Survey

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Survey

Document related concepts

no text concepts found

Transcript

INTRODUCTION AND DESRIPTIVE STATISTICS. THE SCIENCE OF STATISTICS. STATISTICS: Is the science of data. This involves collecting, classifying, summarizing, organizing, analyzing and interpreting numerical information. TYPES OF STATISTICAL APPLICATIONS. Descriptive Statistics Inferential Statistics DESCRIPTIVE STATISTICS: Utilizes numerical and graphical methods to look for patterns in a data set, to summarize the information revealed in the data set and to present that information in a convenient form. Collect Data: Ex. Survey Present Data: Ex. Tables and Graphs Characterize Data: Ex. Sample Mean = X i n INFERENTIAL STATISTICS: Utilizes sample data to make estimates, decisions, predictions or other generalizations about a larger set of data. Estimation: Ex. Estimate the population mean weight using the sample mean weight Hypothesis Testing: Ex. Test the claim that the population mean weight is 120 pounds Drawing conclusions and/or making decisions concerning a population based on sample results. FUNDAMENTAL ELEMENTS OF STATISTICS. A POPULATION: Is a set of units in which we are interested. Typically, there are too many experimental units in a population to consider every one. If we can examine every single one, we conduct a CENSUS. A SAMPLE: Is a subset of the POPULATION. AN EXPERIMENTAL UNIT: Is an object about which we collect data. Ex. Person, Place, Thing, Event... L1-Statistics Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2021-1 A VARIABLE: Is a characteristic or property of an individual unit. The values of these characteristics will not surprisingly vary. A MEASURE OF RELIABILITY: Is a statement about the degree of uncertainty associated with a statistical inference. Ex. Based on our analysis, we think 56% of soda drinkers prefer Pepsi to Coke, ± 5%. DESCRIPTIVE STATISTICS INFERENTIAL STATISTICS The population or sample of interest Population of interest One or more variables to be investigated One or more variables to be investigated Tables, graphs or numerical summary tools The sample of population units Identification of patterns in the data The inference about the population based on the sample data A measure of reliability of the inference TYPES OF DATA. Quantitative Data Categorical (Qualitative) Data QUANTITATIVE DATA: Are measurements that are recorded on a naturally occurring numerical scale. Ex. Age, GPA, Salary, Cost of books this semester... CATEGORICAL (QUALITATIVE) DATA: Are measurements that cannot be recorded on a natural numerical scale, but are recorded in categories. Ex. Live, On/Off campus, Major, Gender... METHODS FOR DESCRIBING SETS OF DATA. QUANTITATIVE DATA PRESENTATION: L1-Statistics Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2021-2 1. ORDERED ARRAY: Organizes data to focus on major features Data placed in rank order from Smallest to Largest Example: Data in Raw Form (as Collected) are: 24, 26, 24, 21, 27, 27, 30, 41, 32, 38... Data in Ordered Array are: 21, 24, 24, 26, 27, 27, 30, 32, 38, 41... 2. STEM - AND - LEAF DISPLAY: Shows the number of observations that share a common value (the stem) and the precise value of each observation (the leaf) Example: Data: 21, 24, 24, 26, 27, 27, 30, 32, 38, 41... 2 144677 From: 21, 24, 24, 26, 27, 27 3 028 From: 30, 32, 38 4 1 From: 41 3. FREQUENCY DISTRIBUTION TABLE: Determine Range Select Number of Classes, usually between 5 and 15 inclusive Compute Class Intervals (Width) Determine Class Boundaries (Limits) Compute Class Midpoints Count Observations and assign to Classes Example-1: Raw Data: 24, 26, 24, 21, 27, 27, 30, 41, 32, 38 CLASS FREQUENCY 15 but < 25 3 25 but < 35 5 35 but < 45 2 Example-2: Raw Data: 24, 26, 24, 21, 27, 27, 30, 41, 32, 38 L1-Statistics Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2021-3 CLASS MIDPOINT FREQUENCY 15 but < 25 20 3 25 but < 35 30 5 35 but < 45 40 2 Where by: 15 is a Lower Boundary, also called a Limit 25 is an Upper Boundary, also called a Limit 15 but < 25 is a Class Interval, also called Width Midpoint = (Lower Boundary + Upper Boundary) / 2 3.1 RELATIVE FREQUENCY DISTRIBUTION TABLE: class frequency class relative frequency = n Where by: n is a sample size CLASS RF(Prop.) 15 but < 25 .3 25 but < 35 .5 35 but < 45 .2 3.2 RELATIVE FREQUENCY PERCENTAGE DISTRIBUTION TABLE: Class percentage = (Class relative frequency) x 100 CLASS RF% 15 but < 25 30.0 25 but < 35 50.0 35 but < 45 20.0 3.3 CUMULATIVE RELATIVE FREQUENCY PERCENTAGE DISTRIBUTION TABLE: CLASS CRF% 15 but < 25 30.0 Initial value remain 25 but < 35 80.0 30% + 50% 35 but < 45 100.0 80% + 20% L1-Statistics Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2021-4 3.4 HISTOGRAM: Is a graph of the frequency or relative frequency of a variable. Class intervals make up the horizontal axis (Lower Boundaries) The frequencies or relative frequencies are displayed on the vertical axis Example: CLASS FREQUENCY 15 but < 25 3 25 but < 35 5 35 but < 45 2 3.5 POLYGON: CLASS MIDPOINT FREQUENCY 15 but < 25 20 3 25 but < 35 30 5 35 but < 45 40 2 L1-Statistics Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2021-5 Where by: Midpoint make up the horizontal axis The frequencies or relative frequencies are displayed on the vertical axis 3.6 OGIVE: Class intervals make up the horizontal axis (Lower Boundaries) The cumulative relative frequencies % are displayed on the vertical axis CATEGORICAL (QUALITATIVE) DATA PRESENTATION: 1. SUMMARY TABLE: Lists categories and number of elements in category Number of elements in category obtained by tallying responses in category Summary may show as well Frequencies (counts), % or both Example: MAJOR COUNTS Accounts 130 Economics 20 Management 50 Total 200 1.1 PIE CHART: Shows breakdown of total quantity into categories Useful for showing relative differences Angle Size = (360°)(Percentage) L1-Statistics Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2021-6 Where by: Angle Size = (360°)(Percentage) Therefore: 360° x 10% = 36° for Economics 360° x 25% = 90° for Management and 360° x 130% = 234° for Accounts 1.2 BAR CHART: Frequencies, also % are displayed on the horizontal axis (Bars' length) Majors are displayed on the vertical axis Horizontal Bars for categorical variables The distance between Bars as you plot the chart should be 1/2 to 1 Bar width The chart shall have equal Bar widths Horizontal Bar Chart Example: Plot a Bar Chart using the table hereunder MAJOR COUNTS Accounts 130 Economics 20 Management 50 Total 200 Solution: L1-Statistics Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2021-7 1.3 PARETO DIAGRAM: % are displayed on the vertical axis (Bars' length) Majors are displayed on the horizontal axis The chart shall have equal Bar widths Bars to be in descending order (Largest to Smallest) Vertical Bar Chart Example: NOTE: The diagram does not correspond to the previous Data NUMERICAL DESCRIPTIVE MEASURES. SUMMARY MEASURES: CENTRAL TENDENCY VARIATION Arithmetic Mean Range (Interquartile) Geometric Mean Variance Median Coefficient of Variation Mode Standard Deviation 1. MEASURES OF CENTRAL TENDENCY: There are various ways to describe the central, most common or middle value in a distribution or set of data such as: The arithmetic mean The geometric mean The median The mode L1-Statistics Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2021-8 Numerical measures of Central Tendency summarizing Data sets numerically. It helps answering questions such as: Are there certain values that seem more typical for the data? How typical are they? Therefore: Central Tendency is the value or values around which the data tend to cluster while Variability shows how strongly data are clustered around those values. 1.1 THE MEAN: Of a set of quantitative data is the sum of the observed values divided by the number of values (sample size). n n x x i 1 i n x i i 1 N Where by: The sample mean is typically denoted by x-bar, but the population mean is denoted by the Greek symbol μ. n: Sample size N: Population size Example: If x1 = 1; x2 = 2; x3 = 3 and x4 = 4...Find mean. Solution: n x x i 1 i = (1 + 2 + 3 + 4)/4 = 10/4 = 2.5 n 1.2 THE MEDIAN (M): Of a set of quantitative data is the value which is located in the middle of the data, arranged from lowest to highest values (or vice versa), with 50% of the observations above and 50% below. L1-Statistics Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2021-9 In order to find Median (M): Arrange the n measurements from smallest to largest If n is odd, M is the middle number If n is even, M is the average of the middle two numbers 1.3 THE MODE: Is the most frequently observed value. The modal class is the midpoint of the class with the highest relative frequency. 1.4 THE GEOMETRIC MEAN: Equals to the nth root of the product of all observations or values. For a set of values: x1, x2, x3, x3, ........., xn Geometric mean equals to: Example: Jim has 20 problems to do for homework. Some are harder than others and take more time to solve. We take a random sample of 9 problems. Find the mean (arithmetic and geometric), median and mode for the number of minutes Jim spends on his homework. Problem # Time spent (Minutes) 01 12 02 4 03 3 04 8 05 7 06 5 07 4 L1-Statistics Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2021-10 Problem # Time spent (Minutes) 08 9 09 11 Data: Sample size (n) = 9 Problems 1 through 9 = x1, x2, x3 … x9, respectively Solution-1, arithmetic mean (AM): n x x i 1 i = (12 + 4 + 3 + 8 + 7 + 5 + 4 + 9 + 11) = 63/9 = 7minutes. n Solution-2, geometric mean (GM): GM = 6.31 Solution-3, median (M): Arrange the n measurements from smallest to largest as follows: 3, 4,4,5,7,8,9,11,12 (n+1)/2 = (9+1)/2 = 5 The 5 th ordered observation is 7 and so is the Median. Solution-4, mode: Arrange the n measurements from smallest to largest as follows: 3, 4,4,5,7,8,9,11,12 Only the value 4 occurs >1 time. Then, the Mode is 4. 1.5 APPROXIMATING THE MEAN FROM A FREQUENCY DISTRIBUTION: Used when the only source of data is a frequency distribution. Where by: n = sample size L1-Statistics Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2021-11 c = number classes in the frequency distribution mj = midpoint of the jth class fj = frequencies of the jth class Example: Approximate the mean using the table hereunder. CLASS MIDPOINT FREQUENCY 10 but < 20 15 3 20 but < 30 25 6 30 but < 40 35 5 40 but < 50 45 4 50 but < 60 55 2 Total 20 Solution: = ((15x3) + (25x6) + (35x5) + (45x4) + (55x2))/20 = (45 + 150 + 175 + 180 + 110)/20 = 660/20 = 33 CONCLUSION. If you have perfectly symmetric data set: Then, Mean = Median = Mode If you have extremely high value in the data set: Then, Mean > Median > Mode (Rightward skewness) If you have extremely low value in the data set: Then, Mean < Median < Mode (Leftward skewness) A data set is skewed if one tail of the distribution has more extreme observations than the other tail. The mean, median and mode give us an idea of the central tendency, or where the “middle” of the data is. Variability gives us an idea of how spread out the data are around that middle. L1-Statistics Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2021-12 2. MEASURES OF VARIATION: NUMERICAL MEASURES OF VARIABILITY: 2.1 RANGE. The range is equal to the largest measurement minus the smallest measurement. Easy to compute, but not very informative Considers only two observations (the smallest and largest) 2.1.1 QUARTILES: Quartiles Split Ordered Data into 4 equal portions. Q1 and Q3 are measures of Non-Central Location, Q2 = the Median. Each Quartile has position and value With the data in an ordered array, the position of Qi is: Where by: n = sample size Qi is the value associated with that position in the ordered array L1-Statistics Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2021-13 Qi i n 1 4 Example 1: Given the following data in Ordered Array, find Q1 Data in Ordered Array: 11 12 13 16 16 17 18 21 22 Solution: i n 1 ; Therefore Position of Q1 1 9 1 2.5 Qi 4 4 Q1 12 13 12.5 2 Example 2: Given data in Ordered Array: 3 4 4 5 7 8 9 11 12 Find the 1st and 3rd Quartiles in the ordered observations above. Solution: Position of Q1 = 1(9+1)/4 = 2.5 The 2.5th observation = (4+4)/2 = 4 Position of Q3 = 3(9+1)/4 = 3(Q1) = 7.5 The 7.5th observation = (9+11)/2 = 10 2.1.2 INTERQUARTILE RANGE (IQR): Is the difference between Q1 and Q3 Is the middle of the values (50%), also known as Midspread. Resistant to extreme values. Example 1: Given the following data in Ordered Array: 11 12 13 16 16 Find the Interquartile range (IQR). Solution: Position of Q1 = 1(9+1)/4 = 2.5 The 2.5th observation = (12+13)/2 = 12.5 Position of Q3 = 3(9+1)/4 = 3(Q1) = 7.5 The 7.5th observation = (17+18)/2 = 17.5 Therefore: The IQR = 17.5 - 12.5 = 5 L1-Statistics Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2021-14 17 17 18 21 Example 2: Given the following data in Ordered Array: 3 4 4 5 7 8 9 11 12 Find the Range and the Interquartile Range in the above distribution. Solution: Range = Largest – Smallest = 12 – 3 = 9 Quartiles are as follows: Position of Q1 = 1(9+1)/4 = 2.5 The 2.5th observation = (4+4)/2 = 4 Position of Q3 = 3(9+1)/4 = 3(Q1) = 7.5 The 7.5th observation = (9+11)/2 = 10 Therefore: The IQR = 10 - 4 = 6 2.2 VARIANCE. SAMPLE VARIANCE (S2): 2.2.1 For a sample of n measurements is equal to the sum of the squared distances from the mean, divided by (n – 1). n s2 (x x ) 2 i i 1 n 1 Where by: S2 = sample variance Xi = distance from the mean n = sample size = sample mean 2.2.2 SAMPLE STANDARD DEVIATION (S): For a sample of n measurements is equal to the square root of the sample variance. n s s2 (x x ) 2 i i 1 n 1 L1-Statistics Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2021-15 Example: Say a small data set consists of the measurements 1, 2 and 3 Find sample variance and sample standard deviation Solution: Compute sample mean first: n x x i 1 i = (1 + 2 + 3) = 6/3 = 2 n Then compute sample variance and sample standard deviation n s2 (x x ) 2 i (3 2) 2 (2 2) 2 (1 2) 2 / (3 1) i 1 n 1 s 2 12 02 12 / 2 2 / 2 1 s s2 1 1 NOTE: Greek letters are used for populations and Roman letters for samples s2 = sample variance s = sample standard deviation σ2 = population variance σ = population standard deviation 2.2.3 COMPARING STANDARD DEVIATIONS: Greater S or σ = more dispersion of data L1-Statistics Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2021-16 2.2.4 INTERPRETING THE STANDARD DEVIATION: Chebyshev’s Rule The Empirical Rule Both tell us something about where the data will be relative to the mean. 2.2.4.1 CHEBYSHEV'S RULE: Valid for any data set For any number k >1, at least (1-1/k2)% of the observations will lie within k standard deviations of the mean k k2 1/k2 1-1/k2 2 4 .25 75% 3 9 .11 89% 4 16 .0625 93.75% THE BIENAYME-CHEBYSHEV RULE: At least (≥) 75% of the observations must be contained within distances of 2 SD around the mean. At least (≥) 88.89% of the observations must be contained within distances of 3 SD around the mean. At least (≥) 93.75% of the observations must be contained within distances of 4 SD around the mean. 2.2.4.2 THE EMPIRICAL RULE: Useful for mound-shaped, symmetrical distributions If not perfectly mounded and symmetrical, the values are approximations L1-Statistics Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2021-17 For a perfectly symmetrical and mound-shaped distribution: ~68% will be within the range: (ẋ-s, ẋ+s) ~95% will be within the range: (ẋ-2s, ẋ+2s) ~99.7% will be within the range: (ẋ-3s, ẋ+3s) Example: Hummingbirds beat their wings in flight an average of 55 times per second. Assume the standard deviation is 10, and that the distribution is symmetrical and mounded. Approximate what percentage of hummingbirds beat their wings: Between 45 and 65 times per second? Between 55 and 65 times per second? Less than 45 times per second? Data: Sample Mean (ẋ) = 55 Standard Deviation (S) = 10 Recall: ~68% will be within the range: (ẋ-s, ẋ+s) ~95% will be within the range: (ẋ-2s, ẋ+2s) ~99.7% will be within the range: (ẋ-3s, ẋ+3s) Solution 1: Approximate what percentage of hummingbirds beat their wings: Between 45 and 65 times per second? Since 45 and 65 are exactly one standard deviation below and above the mean, the empirical rule says that about 68% of the hummingbirds will be in this range. Solution 2: Approximate what percentage of hummingbirds beat their wings: Between 55 and 65 times per second? This range of numbers is from the mean to one standard deviation above it, or one-half of the range in the previous question. Therefore, about one-half of 68% or 34% of the hummingbirds will be in this range. L1-Statistics Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2021-18 Solution 3: Approximate what percentage of hummingbirds beat their wings: Less than 45 times per second? Half of the entire data set lies above the mean, and ~34% lie between 45 and 55 (between one standard deviation below the mean and the mean). Therefore, ~84% = (~34% + 50%) are above 45, which means ~16% are below 45. Exercise: A manufacturer of automobile batteries claims that the average length of life of its grade A battery is 60 months. However, the guarantee on this brand is for just 36 months. Suppose the standard deviation of the life length is known to be 10 months and the frequency distribution of the life-length data is known to be mound shaped. Approximate what percentage of the manufacturer’s grade A batteries will last more than 50 months?, assuming that the manufacturer’s claim is true. Approximate what percentage of the manufacturer’s batteries will last less than 40 months?, assuming that the manufacturer’s claim is true. Suppose your battery last 37 months. What could you infer about the manufacturer’s claim? Data: Sample Mean (ẋ) = 60 Standard Deviation (S) = 10 Solution 1: Approximate what percentage of the manufacturer’s grade A batteries will last more than 50 months?, assuming that the manufacturer’s claim is true. L1-Statistics Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2021-19 Half of the entire data set lies above the mean, and ~34% lie between 50 and 60 (between one standard deviation below the mean and the mean). Therefore, ~84% = (~34% + 50%) are above 50 months, which means ~84% of the manufacturer’s grade A batteries will last more than 50 months. Solution 2: Approximate what percentage of the manufacturer’s batteries will last less than 40 months?, assuming that the manufacturer’s claim is true. The required % will be equals to100% minus the summation of the % (half way above the mean i.e. 50%) and the sum of two standard deviation below the mean i.e. (34% + 13.5% = 47.5%). Therefore, 100% - (50% + 47.5%) = 2.5% Conclusion: ~2.5% of the manufacturer’s batteries will last less than 40 months Solution 3: Suppose your battery last 37 months. What could you infer about the manufacturer’s claim? Since 37 lies between second and third standard deviation (i.e. 40 and 30) below the mean. This means that, chances that a manufacturer’s batteries lasts at most 37 months is ~2.5% obtained from the equation 100% - (50% + 47.5%). Since the manufacturer claimed that " the average length of life of its grade A battery is 60 months. Therefore, the ~2.5% represents a slice chance that the manufacturer’s claim could not be achieved. 2.3 COEFFICIENT OF VARIATION. Measure of Relative Variation Shows variation relative to the Mean Used to compare Two or More sets of data measured in different units S CV X 100% Where by: S = Sample Standard Deviation and X = Sample Mean L1-Statistics Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2021-20 2.3.1 COMPARING COEFFICIENT OF VARIATION: Stock A: Stock B: Average price last year = $50 Average price last year = $100 Standard deviation = $5 Standard deviation = $5 S CVA X S CVB X $5 100% 100% 10% $50 $5 100% 100% 5% $100 Conclusion: Both stocks have the same standard deviation, but stock B is less variable relative to its price. 2.3.2 Z-SCORE: NUMERICAL MEASURES OF RELATIVE STANDING. The z-score tells us how many standard deviations above or below the mean of a particular measurement is. Sample z-score Population z-score xx z s z x Example 1: Hummingbirds beat their wings in flight an average of 55 times per second. Assume the standard deviation is 10, and that the distribution is symmetrical and mounded. An individual hummingbird is measured with 75 beats per second. What is this bird’s zscore? Data: Sample Mean (ẋ) = 55 Standard Deviation (S) = 10 Measurements/ Value (X) = 75 Z=? L1-Statistics Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2021-21 Solution: z 75 55 2.0 10 Therefore, the value 75 is 2.0 standard deviation above the Mean. Example 2: If the mean is 14.0 and the standard deviation is 3.0. What is the Z - score for the value 18.5? Data: Sample Mean (ẋ) = 14.0 Standard Deviation (S) = 3.0 Measurements/ Value (X) = 18.5 ; Z = ? Solution: Z X X 18.5 14.0 1.5 S 3.0 Therefore, the value 18.5 is 1.5 standard deviation above the Mean. NOTE: 1. A negative Z-score would mean that a value is less than the Mean. 2. Z-scores are related to the empirical rule as follows: For a perfectly symmetrical and mound-shaped distribution, then ~68 % will have Z-scores between -1 and 1 ~95 % will have Z-scores between -2 and 2 ~99.7% will have Z-scores between -3 and 3 L1-Statistics Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2021-22 INTERPRETATION OF POINT #2 ABOVE: Since ~95% of all the measurements will be within 2 standard deviations of the Mean, only ~5% will be more than 2 standard deviations from the Mean. About half of this 5% will be far below the mean, leaving only about 2.5% of the measurements at least 2 standard deviations above the mean. 2.3.3 METHODS FOR DETERMINING OUTLIERS: An outlier is a measurement that is unusually large or small relative to the other values. There are three possible causes for the outlier to happen: Observation, recording or data entry error Item is from a different population A rare, chance event The outlier can be identified using the Box Plot (“Box-and-Whisker”). 2.3.3.1 THE BOX PLOT (“Box-and-Whisker”): The box plot is a graph representing information about certain percentiles for a data set and can be used to identify outliers. 5 number summary: Median, Q1, Q3, X smallest, X largest Box Plot: Graphical display of data using 5-number summary 2.3.3.2 DISTRIBUTION SHAPES AND BOX PLOT: Left-Skewed Symmetric L1-Statistics Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2021-23 Right-Skewed 2.3.3.3 OUTLIERS AND Z-SCORES: The chance that a z-score is between -3 and +3 is over 99%. Therefore, any measurement with |z| > 3 is considered an outlier. 2.3.3.4 CORRELATION COEFFICIENT (r): It has no unit (Unit Free) Measures the strength of the linear relationship between 2 quantitative variables Ranges between –1 and 1 where by: The Closer to –1, the stronger the negative linear relationship becomes The Closer to 1, the stronger the positive linear relationship becomes The Closer to 0, the weaker any linear relationship becomes Example: Scatter plots of data with various Correlation Coefficients (r). Scattergram or scatter plot shows the relationship between two quantitative variables. 2.3.3.5 DISTORTING THE TRUTH WITH DECEPTIVE STATISTICS: DISTORTIONS: Stretching the axis (and the truth) Is average relevant?; Mean, median or mode? Is average relevant?; What about the spread? L1-Statistics Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2021-24 BASIC PROBABILITY. WHY PROBABILITY. The following situations provide examples of the role of uncertainty in our lives/ a business context: Investment counselors cannot be sure which of two stocks will deliver the better growth over the coming year. Engineers try to reduce the likelihood that a machine will break down. Marketers may be uncertain as to the effectiveness of an Ad. campaign or the eventual success of a new product. Product manufacturers and system designers need to have testing methods that will assess various aspects of reliability. Long lifetimes> time consuming >we need “accelerated” testing methods. Inventory management. BASIC CONCEPTS. RANDOM EXPERIMENT: Is a process leading to at least two possible outcomes with uncertainty as to which will occur. Example: A coin is thrown A consumer is asked which of two products he or she prefers SAMPLE SPACES: Is a collection of all possible outcomes. Example: Examine three fuses in sequence and note the result of each examination. Outcome for the entire experiment is any sequence of Ns and Ds of length 3. Sample space s={NNN, NND, NDN, NDD, DNN, DND, DDN, DDD} AN EVENT: Is any collection (subset) of outcomes contained in the sample space S. An event is said to be simple if it consists of exactly one outcome and A compound if it consists of more than one outcome. JOINT EVENT: Is when 2 events occurring simultaneously. Example: Male and Age over 20. L2-Probability Theory Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2022-1 UNIONS AND INTERSECTION: Intersection: (A and B, (AÇB)) Union: (A or B, (A È B)) EVENT PROPERTIES: Mutually Exclusive: Two outcomes that cannot occur at the same time. Example: Flip a coin, resulting in head or tail. Collectively Exhaustive: One outcome in sample space must occur. Example: Male or Female. SPECIAL EVENTS: Null Event: Example: Club & Diamond on 1 Card Draw. Complement of Event: For Event A, All Events Not In A: A' or Ā. WHAT IS PROBABILITY? Focuses on a systematic study of randomness and uncertainty. Provides methods for quantifying the chances, or likelihoods associated with the various outcomes Numerical measure of likelihood that the event will occur lies between 0 & 1, i.e. Sum of events is 1. 0: Impossible 1: Certain L2-Probability Theory Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2022-2 CONCEPTS OF PROBABILITY. Priori "classical probability": The probability of success is based on prior knowledge of the process involved. Example: The chance of picking a black card from a deck of cards. Empirical: The outcomes are based on observed data, not on prior knowledge of a process. Example: The chance that individual selected at random from employee survey if satisfied with his or her job. Classical probability: Based on formal reasoning. Subjective probability: The chance of occurrence assigned to an event by a particular individual, based on his/her experience, personal opinion and analysis of a particular situation. Example: The chance of a newly designed style of mobile phone will be successful in market. COMPUTING PROBABILITIES. NOTE: Each of the outcomes in the sample space is equally likely to occur. Where by: P(E): Probability of an Event E. X: Number of event outcomes. T: Total number of possible outcomes in the sample space. PRESENTING PROBABILITY AND SAMPLE SPACE. Listing Venn Diagram Tree Diagram Contingency Table LISTING: S = {Head, Tail} L2-Probability Theory Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2022-3 VENN DIAGRAM: Let A = aces Let B = red cards TREE DIAGRAM: CONTINGENCY TABLE: Ace Not Ace Total Black 2 24 26 Red 2 24 26 Total 4 48 52 JOINT PROBABILITY USING CONTINGENCY TABLE: Event Event B1 B2 Total A1 P(A1 ∩ B1) P(A1 ∩ B2) P(A1) A2 P(A2 ∩ B1) P(A2 ∩ B2) P(A2) Total P(B1) P(B2) 1 Where by: P(A1 ∩ B1); P(A2 ∩ B1); P(A1 ∩ B2) and P(A2 ∩ B2) are Joint Probability. P(A1); P(A2); P(B1) and P(B2) are Marginal/ Simple Probability. COMPOUND PROBABILITY. ADDITIONAL RULE: Used to Get Compound Probabilities for Union of Events. L2-Probability Theory Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2022-4 P(A or B) = P(AB) = P(A) + P(B) P(AB). For Mutually Exclusive Events: P(A or B) = P(AB) = P(A) + P(B). For Probability of Compliment: P(A) + P(Ā) = 1. So, P(Ā) = 1 P(A). Example: A hamburger chain found that 75% of all customers use mustard, 80% use ketchup, 65% use both. What is the probability that a particular customer will use at least one of these? Given: Let A = Customers use mustard; P(A) = .75 Let B = Customers use ketchup; P(B) = .80 P(AB) = .65 (Both) P(AB) = (At least one these = A or B) Solution: P(AB) = P(A) + P(B) P(AB) = .75 + .80 .65= .90 MULTIPLICATION RULE: Used to Get Joint Probabilities for Intersection of Events (Joint Events). P(A and B) = P(AB). P(AB) = P(A)*P(B|A) = P(B)*P(A|B). For Independent Events: P(A and B) = P(AB) = P(A)*P(B). COMPUTING CONDITIONAL PROBABILITIES. A conditional probability is the probability of one event, given that another event has occurred: P(A | B) P(A and B) The conditional probability of A given that B has occurred. P(B) P(B | A) P(A and B) The conditional probability of B given that A has occurred. P(A) L2-Probability Theory Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2022-5 Where by: P(A and B) = Joint Probability of A and B P(A) = Marginal Probability of A P(B) = Marginal Probability of B Example: Of the cars on a used car lot, 70% have air conditioning (AC) and 40% have a CD player (CD). 20% of the cars have both. What is the probability that a car has a CD player, given that it has AC ? Given: Let A = Cars with AC; P(A) = .7 Let B = Cars with CD; P(B) = .4 P(AB) = .2 P(B | A) = P(CD | AC) = Solution: Recall: P(B | A) P(CD | AC) P(A and B) P(A) P(CD and AC) 0.2 0.2857 P(AC) 0.7 By using Contingency Table: Event Recall: P(B | A) P(CD | AC) Event CD No CD Total AC .2 .5 .7 No AC .2 .1 .3 Total .4 .6 1.0 P(A and B) P(A) P(CD and AC) 0.2 0.2857 P(AC) 0.7 Conclusion: The probability that a car has a CD player, given that it has AC = .2857 = 28.6%. L2-Probability Theory Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2022-6 BAYES' THEOREM. Permits Revising Old Probabilities based on New Information. Prior Probability. New information. Application of Conditional Probability. Mutually Exclusive and Exhaustive Events. Therefore, the computation of a Posterior Probability P(Ai | B) from given Prior Probabilities P(Ai) and Conditional Probabilities P(B | Ai) is as follows: Application of Conditional Probability. Recall: P(A | B) P(A and B) P(A and B) and P(B | A) P(B) P(A) P(AB) = P(A)*P(B|A) = P(B)*P(A|B) ..........Eqn 1. P(Ai | B) P(Ai) * P(B | Ai) P(B) ...........Eqn 2. This is Bayes' Theorem. Generalized form of Bayes' Theorem "Revised Probability" Given k Mutually Exclusive and Exhaustive Events B1, B2,… Bk, and an observed event A, then: P(B) = P(A1)*P(B|A1) + P(A2)*P(B|A2) + P(A2)*P(B|A2) + P(Ak)*P(B|Ak) ...........Eqn 3. n P(B) P(Ai) * P(B | Ai) i 1 Bayes' Theorem reference diagrams: Conditional Probability Sample space and Interaction of Events L2-Probability Theory Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2022-7 Example 1: A drilling company has estimated a 40% chance of striking oil for their new well. A detailed test has been scheduled for more information. Historically, 60% of successful wells have had detailed tests, and 20% of unsuccessful wells have had detailed tests. Given that this well has been scheduled for a detailed test, what is the probability that the well will be successful? Given: Prior Probabilities are: Let S = successful well: P(S) = .4 Let U = successful well: P(U) = .6 Conditional Probability are: Let D = Detailed Test Event P(D | S) = .6 P(D | U) = .2 P(S | D) = Solution: P(D | S)P(S) P(D | S)P(S) P(D | U)P(U) (0.6)(0.4) (0.6)(0.4) (0.2)(0.6) 0.24 0.667 0.24 0.12 P(S | D) Conclusion: Given the detailed test, the revised probability of a successful well has risen to 0.667 from the original estimate of 0.4 Using Tabula form: Event Prior Prob. Conditional Prob. Joint Prob. Revised Prob. S (successful) .4 .6 0.4x0.6=0.24 0.24/0.36 = 0.667 U (unsuccessful) .6 .2 0.6x0.2=0.12 0.12/0.36 = 0.333 = 0.36 = 1.0 = 1.0 L2-Probability Theory Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2022-8 Example 2: Fifty percent of borrowers repaid their loans. Out of those who repaid, 40% had a college degree. Ten percent of those who defaulted had a college degree. What is the probability that a randomly selected borrower who has a college degree will repay the loan? Given: Let B1 = Repay; B2 = Default, A = College degree P(B1) = .5; P(A|B1) = .4, P(A|B2) = .1 P(B1|A) = Solution: P( B1 | A) (.4)(.5) .2 P( A | B1 ) P( B1 ) .8 P( A | B1 ) P( B1 ) P( A | B2 ) P( B2 ) (.4)(.5) (.1)(.5) .25 Using Tabula form: Event Prior Prob. Conditional Prob. Joint Prob. Revised Prob. Bi P(Bi) P(A | Bi) P(BiA) P(Bi | A) B1 (Repay) .5 .4 0.5x0.4=0.20 0.20/0.25 = 0.8 B2 (Default) .5 .1 0.5x0.1=0.5 0.5/0.25 = 0.2 = 0.25 = 1.0 = 1.0 PERMUTATION AND COMBINATION. PERMUTATION: Counting Rule 1: If any one of n different mutually exclusive and collectively exhaustive events can occur on each of r trials, the number of possible outcomes is equal to: n·n ·… ·n ＝ nr Counting Rule 2: The number of ways that all n objects can be arranged in order is: n(n -1)(n -2)(2)(1) = n!; Where n! is called Factorial and 0! is defined as 1 Example: There are 20 candidates for three different mechanical engineer positions, E1, E2, and E3. How many different ways could you fill the positions? L2-Probability Theory Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2022-9 Solution: 20x19x18x..........x1 = 6840. Counting Rule 3 "Permutation": The number of ways of arranging r objects selected from n objects in order is: Prn n! ( n r )! COMBINATION: The number of ways that arranging r objects selected from n objects irrespective of the order is equal to: Crn r!(nn! r )! n r Example: Five sales Engineers will be hired from a group of 100 applicants. In how many ways (Combinations) can groups of 5 sales Engineers be selected? Given: n = 100; r = 5 Solution: Crn 100! 5!(100 75,287,520 5)! n r RANDOM VARIABLE. A random variable is a variable that assumes numerical values associated with the random outcome of an experiment, where one (and only one) numerical value is assigned to each sample point. TYPES OF RANDOM VARIABLE. Discrete random variable Continuous random variable A discrete random variable: Can assume a countable number of values "obtained by counting" A random variable that can take on only certain values along an interval, with the possible values having gaps between them. Ex: Number of steps to the top of a Tower. L2-Probability Theory Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2022-10 Example 1: Counter number of Tails when two coins are tossed and Probability Distribution table. Given: If H = Head and T = Tail. S = {HH, HT, TH, TT} Solution: Probability Distribution Event: Toss two coins Values: Probability: HH 0 1/4=.25 HT; TH 2 2/4=.50 TT 1 1/4=.25 Example 2: Six batches of components are ready to be shipped by a supplier. The number of defective components in each batch is as follows: Batch #1 #2 #3 #4 #5 #6 # of Defectives 0 2 0 1 2 0 Solution: P(0) = P(Batch 1, 3 and 6) = 3/6 = 0.500 P(1) = P(Batch 4) = 1/6 = 0.167 P(2) = P(Batch 2 and 5) = 2/6 = 0.333 A continuous random variable: Can assume any value along a given interval of a number line. Example: The time a tourist stays at the top once s/he gets there Exact temperature outside PROBABILITY DISTRIBUTIONS FOR DISCRETE RANDOM VARIABLES. The Probability Distribution (Probability Mass Function) of a discrete random variable is a graph, table or formula that specifies the probability associated with each possible outcome the random variable can assume i.e. [Xj , p(Xj) ] pairs. Where by: Xj = Value of random variable P(Xj) = Probability associated with value P(x) ≥ 0 for all values of x and p(x) = 1 L2-Probability Theory Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2022-11 Example: Say a random variable x follows this pattern: P(x) = (.3)(.7)x-1 for x > 0. X P(x) X P(x) 1 .30 6 .05 2 .21 7 .04 3 .15 8 .02 4 .11 9 .02 5 .07 10 .01 EXPECTED VALUES OF DISCRETE RANDOM VARIABLES. The mean, or expected value, of a discrete random variable is: E ( x) xp( x). The variance of a discrete random variable x is: 2 E[( x )2 ] ( x )2 p( x). The standard deviation of a discrete random variable x is: 2 E[( x )2 ] (x ) 2 p( x). IMPORTANT DISCRETE PROBABILITY DISTRIBUTIONS. THE BINOMIAL DISTRIBUTION: Properties of a Binomial Random Variable. n: Identical trials. Example: Flip a coin 3 times Two outcomes: Success or Failure. Example: Heads and Tails P(S) = p and P(F) = q = 1 – p. Example: P(H) = .5; P(F) = 1-.5 = .5 Trials are independent. Example: A head on flip i doesn’t change P(H) of flip i + 1 x is the number of Successes in n trials L2-Probability Theory Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2022-12 n P( x) p x q n x ..........Eqn 1 x Where by: n : The number of ways of getting the desired results x p x : The probability of getting the required number of successes q n x : The probability of getting the required number of failures x!(nn! x)! n x ..........Eqn 2 Inserting Eqn 2 into Eqn 1 p( x) n! p x (1 p ) n x x!(n x)! Example: What is the probability of one success in five observations if the probability of success is .1? Given: x = 1, n = 5, and p = 0.1 Solution: n! p X (1 p) n X x! (n x)! 5! (0.1)1 (1 0.1)51 1!(5 1)! P(x 1) (5)(0.1)(0.9) 4 0.32805 NOTE: A Binomial Random Variable also has: np Mean Variance 2 npq Standard Deviation npq L2-Probability Theory Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2022-13 POSSIBLE BINOMIAL DISTRIBUTION SETTINGS. A manufacturing plant labels items as either defective or acceptable A firm bidding for contracts will either get a contract or not A marketing research firm receives survey responses of “yes I will buy” or “no I will not” New job applicants either accept the offer or reject it THE HYPER GEOMETRIC DISTRIBUTION. Recall: In the Binomial situation, each trial was independent i.e. Drawing cards from a deck and replacing the drawn card each time. Now: If the card is not replaced, each trial depends on the previous trial(s). The Hyper geometric distribution can be used in this case. Randomly draw n elements from a set of N elements, without replacement. Assume there are r successes and N-r failures in the N elements. Therefore: The Hyper geometric random variable is the number of successes; x, drawn from the r available in the n selections. r N r x n x P( x) N n Where by: N = Total number of elements (Population size) r = Number of successes in the N elements (Successes in the population) n = Number of elements drawn (Sample size) x = Number of successes in the n elements (successes in the sample) NOTE: The Hyper geometric distribution also has: Variance: Mean: 2 nr N r ( N r ) n( N n) N 2 ( N 1) Example: Three Light bulbs were selected from ten. Of the ten, four were defective. What is the probability that two of the three selected are defective? L2-Probability Theory Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2022-14 Given: N = 10; n = 3; r = 4 and X = 2. Solution: r N r 4 10 4 x n x 2 3 2 Recall: P ( x) = P(2) = .30 N 10 n 3 THE POISSON DISTRIBUTION. Evaluates the probability of a number (usually small) of occurrences out of many opportunities in a … Period of time Area Volume Weight Distance Other units of measurement P( x) x e x! Where by: = Mean number of occurrences in the given unit of time, area, volume, etc. e = 2.71828…. µ= σ2 = x = Number of successes per unit Example: Suppose the number x of cracks per concrete specimen for a particular type of cement mix has approximately a Poisson probability distribution. Furthermore, assume that the average number of cracks per specimen is 2.5. Find the probability that a randomly selected concrete specimen has exactly five cracks. Given: = 2.5 e = 2.71828…. and x = 5 L2-Probability Theory Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2022-15 Solution: P( x 5) x e 2.55 e 2.5 0.067 x! 5! COMPARISON. The Poisson probability distribution is related to and can be used to approximate a binomial probability distribution when n is large and = np is small. Exercise: An acceptance sampling plan selects 5 items from a population of 500 items, 16 of which are unacceptable. The lot is accepted if at most 2 of the sampled items are unacceptable. Compare the exact (hypergeometric) probability with both binomial and Poisson approximations. Recall: N = Total number of elements (Population size) r = Number of successes in the N elements (Successes in the population) n = Number of elements drawn (Sample size) x = Number of successes in the n elements (successes in the sample) Given: N = 500; n = 5; r = 16 and x = 2. Solution: r N r 16 500 16 x n x 2 5 2 P( x) = P(2) = N 500 n 5 Where by: 16 484 2 3 = 1.8 x 10-5 500 5 r!(nn! r )! r x Comparison: Poisson: = 5; e = 2.71828…. and x = 2 P( x) x e x! 52 * 2.718285 ; P ( 2) 2 *1 = .08 Binomial: x = 2; n = 5; p = 2/5 = .4 and q = 1-0.4 = .6 L2-Probability Theory Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2022-16 n P( x) p x q n x x p( x) p ( 2) n! p x (1 p ) n x ; x!(n x)! 5 * 4 * 3! * (0.4) 2 (.06) 3 2 * 3! Distribution Type: Probability: = .34 Hyper geometric Poisson Binomial 1.8 x 10-5 .08 .34 Conclusion: L2-Probability Theory Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2022-17 CONFIDENCE INTERVAL ESTIMATION. Statistical inference consists of those methods used to make decisions or to draw conclusions about a population. These methods utilize the information contained in a sample from the population in drawing conclusions. Divided into two major areas: Parameter Estimation Hypothesis Testing CONFIDENCE INTERVALS. Confidence Intervals for the Population Mean is μ. when Population Standard Deviation σ is Known when Population Standard Deviation σ is Unknown Confidence Intervals for the Population Proportion is p. Confidence Intervals for the Population Standard deviation is σ. They are to be used to determining the Required Sample Size. Point and interval estimates: A Point Estimate is a single number. A Confidence Interval provides additional information about variability. Point estimates: We can estimate a With a Sample Statistic Population Parameter (Point Estimate) Mean μ x Proportion π p Standard Deviation σ s How much uncertainty is associated with a point estimate of a population parameter? An interval estimate provides more information about a population characteristic than it does for a point estimate. Such interval estimates are called Confidence Intervals. L3-Confidence Intervals Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2022-1 CONFIDENCE INTERVAL ESTIMATE. An interval gives a range of values: Takes into consideration variation in sample statistics from sample to sample; Based on observations from 1 sample; Gives information about closeness to unknown population parameters; Stated in terms of Level of Confidence; Can never be 100% confident. Estimation process: General formula: The general formula for all confidence intervals is: Point Estimate ± (Critical Value)(Standard Error) = X Z σ n CONFIDENCE LEVEL (1-). Confidence Level: Confidence for which the interval will contain the unknown population parameter. A percentage (less than 100%). Suppose Confidence Level = 95%. Also written (1-) = 0.95 where: is a threshold that you use to categorize a result as either explainable by chance alone or not explainable by chance alone. A relative frequency interpretation: In the long run, 95% of all the confidence intervals that can be constructed will contain the unknown true parameter. A specific interval either will contain or will not contain the true parameter. No probability involved in a specific interval. L3-Confidence Intervals Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2022-2 CONFIDENCE INTERVAL FOR μ (σ Known). Assumptions: Population standard deviation σ is known; Population is normally distributed; If Population is not normal, use large sample. Confidence Interval Estimate: σ n XZ Where: X is the point estimate Z is the normal distribution critical value for a probability of /2 in each tail σ/ n is the standard error Common Levels of Confidence: Commonly used confidence levels are 90%, 95%, and 99% Confidence Coefficient Confidence Level 1 Z value 80% 0.80 1.28 90% 0.90 1.645 95% 0.95 1.96 98% 0.98 2.33 99% 0.99 2.58 99.8% 0.998 3.08 99.9% 0.999 3.27 Finding the Critical Value (Z): Consider a 95% Confidence Interval; Z 1.96 L3-Confidence Intervals Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2022-3 Intervals and Level of Confidence: Sampling Distribution of the Mean: Interval extended from: X Z σ n (1-)x100% of intervals constructed contain μ; to X Z σ n ()x100% do not. Example: A sample of 11 circuits from a large normal population has a mean resistance of 2.20 ohms. We know from past testing that the population standard deviation is 0.35 ohms. Determine a 95% confidence interval for the true mean resistance of the population. Given: X = 2.20; σ = 0.35; Z = 1.96; n= 11; μ = ? Solution: XZ σ 2.20 1.96 (0.35/ 11) 2.20 0.2068 = 1.9932 2.4068 n Interpretation: We are 95% confident that the true mean resistance is between 1.9932 and 2.4068 ohms. Although the true mean may or may not be in this interval, 95% of intervals formed in this manner will contain the true mean. L3-Confidence Intervals Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2022-4 CONFIDENCE INTERVAL FOR μ (σ Unknown). If the population standard deviation σ is unknown, we can substitute the sample standard deviation S. This introduces extra uncertainty, since S is variable from sample to sample. So we use the t distribution instead of the normal distribution. Assumptions: Population standard deviation is unknown Population is normally distributed If population is not normal, use large sample Use Student’s t Distribution Confidence Interval Estimate: X t n -1 S n Where: X is the point estimate t is the critical value of the t distribution with n -1 degrees of freedom and an area of α/2 in each tail. S is the standard error n Student’s t Distribution: The t is a family of distributions The t value depends on degrees of freedom (df) Degrees of freedom is a number of observations that are free to vary after sample mean has been calculated. d.f. = n - 1 Degrees of Freedom (df): Idea: Number of observations that are free to vary after sample mean has been calculated. Example: Suppose the mean of 3 numbers is 8.0 L3-Confidence Intervals Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2022-5 Let X1 = 7; Let X2 = 8; What is X3? Solution: If the mean of these three values is 8.0 Then X3 must be 9; i.e. X3 is not free to vary. Here, n = 3, so Degrees of Freedom: n – 1 = 3 – 1 = 2 2 values can be any numbers but the third is not free to vary for a given mean. NOTE: t → Z as n increases. Fig: Student’s t Distribution. Example: Let: n = 3; df = n - 1 = 2; = 0.10; /2 = 0.05 Upper Tail: df .25 .10 .05 1 1.000 3.078 6.314 2 0.817 1.886 2.920 3 0.765 1.638 2.353 NOTE: The body of the table contains t values, not probabilities. t Distribution values: With comparison to the Z value. Confidence t t t Level (10 d.f.) (20 d.f.) (30 d.f.) 0.80 1.372 1.325 1.310 1.28 0.90 1.812 1.725 1.697 1.645 0.95 2.228 2.086 2.042 1.96 0.99 3.169 2.845 2.750 2.58 NOTE: t → Z as n increases. L3-Confidence Intervals Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2022-6 Z Example: A random sample of n = 25 has X = 50 and S = 8. Form a 95% confidence interval for μ Given: n = 25; X = 50; S = 8; df = n - 1 = 25-1 = 24 95% confidence interval means: (1 - ) = 0.95, = 1-0.95 = 0.05; /2 = 0.005 t/2 , n 1 t 0.025,24 2.0639 Solution: S 8 50 (2.0639) n 25 X t/2, n -1 46.698 ≤ μ ≤ 53.302 CONFIDENCE INTERVALS FOR THE POPULATION PROPORTION π. An interval estimate for the population proportion ( π ) can be calculated by adding an allowance for uncertainty to the sample proportion ( p ). Assumptions: Two categorical outcomes Population follows binomial distribution Normal approximation can be used if n·p > 5 and n·(1 - p) > 5 The distribution of the sample proportion is approximately normal if the sample size is large, with standard deviation. σp (1 ) n We will estimate this with sample data: p(1 p) n L3-Confidence Intervals Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2022-7 Confidence Interval Endpoints: Upper and lower confidence limits for the population proportion are calculated with the formula: pZ p(1 p) n Where by: Z is the standard normal value for the level of confidence desired p is the sample proportion n is the sample size Example: A random sample of 100 people shows that 25 are left-handed. Form a 95% confidence interval for the true proportion of left-handers. Given: n = 100; p = 25/100 = 0.25; Z = 1.96 Solution: p Z p(1 p)/n 0.25 1.96 0.25(0.75)/100 0.25 1.96 (0.0433) 0.1651 0.3349 Interpretation: We are 95% confident that the true percentage of left-handers in the population is between 16.51% and 33.49%. Although the interval from 0.1651 to 0.3349 may or may not contain the true proportion, 95% of intervals formed from samples of size 100 in this manner will contain the true proportion. CONFIDENCE INTERVALS FOR VARIANCES AND STANDARD DEVIATIONS. Use chi-square distribution Table. L3-Confidence Intervals Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2022-8 Confidence Intervals for Variances: n 1s 2 2 n 1s 2 2 right 2 left Standard Deviations: n 1s 2 2 right n 1s 2 2 left Example: Find the 95% confidence interval for the variance and standard deviation of the nicotine content of cigarettes manufactured if a sample of 20 cigarettes has a standard deviation of 1.6 milligrams. Given: 95% confidence interval: α = 0.05, α/2 = 0.025; n = 20; S = 1.6 Find critical values for 0.025 and (1-0.025)= 0.975 with 19 degrees of freedom (d.f.) So: 0.025 → 32.852 and 0.975 → 8.907 from chi-square distribution Table. L3-Confidence Intervals Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2022-9 Solution: n 1s 2 2 n 1s 2 20 11.6 = 2 right 2 left 32.852 2 2 20 11.62 = 1.5 2 5.5 8.907 One-Sided Confidence Bounds: Substitute Zα/2 or tα/2 with Zα or tα Confidence Interval for a Difference in Mean: General Distribution: X 1 X 2 Z 12 2 n1 22 2 ; n2 Z 2 2 (σ1 σ 2 ) n1 n2 e2 2 Where by: Both samples are taken at random from the respective populations of interest. Samples are taken independently of each other. Both the sample sizes are large enough to get an proximate normal distribution for the difference in sample means. Example: A farm-equipment manufacturer wants to compare the average daily downtime for two sheet-metal stamping machines located in factories at two different locations. Investigation of company records for 100 randomly selected days on each of the two machines gave the following results: Sample size Mean Variance 100 12 6 100 9 4 Construct a 90% confidence interval estimate for the difference in mean daily downtimes for sheet-metal stamping machines located at two locations. Given: X 1 = 12; X 2 = 9; 12 = 6; 22 = 4; n1 = n2 = 100 90% confidence interval: α = 0.10, α/2 = 0.05, So: Zα/2 = Z0.05 = 1.645 X 1 X 2 Z 12 2 n1 22 n2 = 12 9 Z0.05 6 4 = 3 0.52 100 100 2.48 and 3.52 L3-Confidence Intervals Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2022-10 Interpretation: We are about 90% confident that the difference in mean daily downtimes for 100 sheet-metal stamping machines at two locations is between 2.48 and 3.52 min. DETERMINING SAMPLE SIZE. For the Mean For the Proportion Sampling Error: The required sample size can be found to reach a desired margin of error (e) with a specified level of confidence (1 - ). The margin of error is also called sampling error The amount of imprecision in the estimate of the population parameter The amount added and subtracted to the point estimate to form the confidence interval. Determining Sample Size for the MEAN: X Z σ σ Z 2 σ2 where by: e Z (Sampling / Margin Error); Therefore, n e2 n n NOTE: To determine the required sample size for the mean, one must know: The desired level of confidence (1 - ), which determines the critical Z value The acceptable sampling error, e The standard deviation, σ Example: If = 45, what sample size is needed to estimate the mean within ± 5 with 90% confidence? Given: = 45; e = 5 90% confidence interval: α = 0.10, α/2 = 0.05, So: Zα/2 = Z0.05 = 1.645 Solution: n Z 2 σ 2 (1.645)2 (45)2 219.19 , So the required sample size is n = 220 e2 52 (Always round up). L3-Confidence Intervals Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2022-11 If σ is unknown: The unknown σ can be estimated using the required sample size formula. Use a value for σ that is expected to be at least as large as the true, σ Select a pilot sample and estimate σ with the sample standard deviation, S Determining Sample Size for the PROPORTION: Z 2 π (1 π ) π (1 π ) n eZ , e2 n NOTE: To determine the required sample size for the proportion, one must know: The desired level of confidence (1 - ), which determines the critical Z value The acceptable sampling error, e The true proportion of “successes”, π π can be estimated with a pilot sample, if necessary (or conservatively use π = 0.5) Example: How large a sample would be necessary to estimate the true proportion defective in a large population within ±3%, with 95% confidence? (Assume a pilot sample yields p = 0.12). Given: For 95% confidence, Z = 1.96 e = 0.03 p = 0.12, so use this to estimate π Solution: Z2 π (1 π ) (1.96)2 (0.12)(1 0.12) n 450.74 , e2 (0.03)2 So use n = 451 (Always round up). Ethical issues: A confidence interval estimate (reflecting sampling error) should always be included when reporting a point estimate; The level of confidence should always be reported; The sample size should be reported; An interpretation of the confidence interval estimate should also be provided. L3-Confidence Intervals Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2022-12 Exercise: A laboratory scale is known to have a standard deviation of σ = 0.001 gram in repeated weighing. Scale readings in repeated weighing are Normally distributed, with mean equal to the true weight of the specimen. Three weighing of a specimen gave 3.412, 3.414, 3.415. Given 95% confidence interval for the true weight of the specimen. What are the estimate and the margin of error in this interval? How many weighing must be averaged to get the margin of error of 0.0005 Given: σ = 0.001; n = 3; X = (3.412+3.414+3.415)/3 = 3.41 For 95% confidence, Z = 1.96 Solution: Z 2 σ2 σ σ Recall: X Z where by: e Z (Sampling / Margin Error); Therefore, n e2 n n Part I: What are the estimate and the margin of error in this interval? X Z σ 0.001 = 3.41 1.96 = 3.41+(1.13x10-3) = 3.40 and 3.42 (Estimates). n 3 Again: e Z σ 0.001 = 1.96 =1.13x10-3 = 0.0013 (Margin of error). n 3 Part II: How many weighing must be averaged to get the margin of error of 0.0005 σ = 0.001; Z = 1.96; e = 0.0005; n = ? Z 2 σ 2 (1.96) 2 (0.001)2 n =15.36, So the required sample size is n = 16 e2 (0.0005) 2 For more examples, kindly visit: http://www.ce.memphis.edu/3103/pdfs/Confidence%20Intervals_full.pdf https://www.che.utah.edu/~tony/OTM/CI-CL/ L3-Confidence Intervals Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2022-13 FUNDAMENTALS OF HYPOTHESIS TESTING: ONE-SAMPLE TESTS. Objectives: Structure engineering decision-making problems as hypothesis tests; Test hypotheses on the mean of a normal distribution using either a Z-test or a ttest procedure; Test hypotheses on the variance or standard deviation of a normal distribution Test hypotheses on a population proportion; Use the P-value approach for making decisions in hypotheses tests; Compute power, type II error probability, and make sample size selection decisions for tests on means, variances, and proportions; Explain and use the relationship between conﬁdence intervals and hypothesis tests. What is a Hypothesis? A statistical hypothesis is a claim (assumption) about a population parameter. It is a statement about the nature of a population. It is often stated in terms of a population parameter i.e. Population mean and Population proportion. Population mean example: Burning rate of a solid propellant used to power aircrew escape systems is μ = 50 cm/sec Population proportion example: The proportion of adults in this city with cell phones is π = 0.68 The Null Hypothesis, H0: States the claim or assertion to be tested; Is always about a population parameter eg. H0: μ = 50 and not about a sample statistic. Determined by: Past experience or knowledge of the process or previous tests or experiments: changes Theory or model regarding the process: verification External considerations such as design or engineering conformation Begin with the assumption that the null hypothesis is true Similar to the notion of innocent until proven guilty Refers to the status quo Always contains “=” , “≤” or “” sign L4a- Hypothesis Testing-One Sample Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2022-1 speciﬁcations: May or may not be rejected The Alternative Hypothesis, H1: Is the opposite of the null hypothesis e.g. burning rate of a solid propellant used to power aircrew escape systems is not equal to 50 ( H1: μ ≠ 50 ) Challenges the status quo Never contains the “=” , “≤” or “” sign May or may not be proven Is generally the hypothesis that the researcher is trying to prove Hypothesis Testing Process: 1: 2: Claim: The population mean age is 50. (Null Hypothesis: H0: μ = 50 ) Population 4: Suppose sample mean age is 20: 3: X =20. Is X =20 likely if μ = 50 ? If not likely, REJECT Null Hypothesis. Now select a random sample Reason for Rejecting H0: Level of Significance, : Defines the unlikely values of the sample statistic if the null hypothesis is true. Defines rejection region of the sampling distribution. L4a- Hypothesis Testing-One Sample Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2022-2 Is designated by , (level of significance); Typical values are 0.01, 0.05, or 0.10; Is selected by the researcher at the beginning; Provides the critical value(s) of the test. Level of Significance and the Rejection Region: Errors in Making Decisions: Type I Error Reject a true null hypothesis H0 is considered a serious type of error. The probability of Type I Error is Called level of significance of the test Set by the researcher in advance Type II Error Fail to reject a false null hypothesis H0 The probability of Type II Error is β Outcomes and Probabilities: Possible Hypothesis Test Outcomes: L4a- Hypothesis Testing-One Sample Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2022-3 KEY: BLUE: Outcome RED: Probability Type I & II Error Relationship: Type I and Type II errors cannot happen at the same time: Type I error can only occur if H0 is true Type II error can only occur if H0 is false If Type I error probability ( ) ↑ , then Type II error probability ( β ) ↓ Factors Affecting Type II Error: All else equal: β ↑ when the difference between hypothesized parameter and its true value ↓ β ↑ when ↓ β ↑ when σ ↑ β ↑ when n ↓ Hypothesis Tests for the Mean: Z Test of Hypothesis for the Mean (σ Known): Convert sample statistic ( X ) to a Z test statistic: Hypothesis Testing Approaches: There are three basic approaches to conducting a hypothesis test: L4a- Hypothesis Testing-One Sample Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2022-4 1. Using a predetermined level of significance, establish critical value(s), then see whether the calculated test statistic falls into a rejection region for the test. (critical value). 2. Determine the exact level of significance associated with the calculated value of the test statistic. In this case, we’re identifying the most extreme critical value that the test statistic would be capable of exceeding. (p value). 3. Confidence Intervals. Critical Value Approach to Testing: For a two-tail test for the mean, σ known: Convert sample statistic ( X ) to test statistic (Z statistic ); Determine the critical Z values for a specified level of significance from a table or computer; Decision Rule: If the test statistic falls in the rejection region, Reject H0 ; otherwise do not Reject H0 Two-Tail Tests: There are two cut off values (critical values), defining the regions of rejection: 6 Steps in Hypothesis Testing: 1. State the null hypothesis, H0 and the alternative hypothesis, H1; 2. Choose the level of significance, and the sample size, n; 3. Determine the appropriate test statistic and sampling distribution; 4. Determine the critical values that divide the rejection and non rejection regions; 5. Collect data and compute the value of the test statistic; L4a- Hypothesis Testing-One Sample Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2022-5 6. Make the statistical decision and state the managerial conclusion. If the test statistic falls into the non rejection region, do not reject the null hypothesis H0. If the test statistic falls into the rejection region, reject the null hypothesis H0. Express the managerial conclusion in the context of the problem. Example: Suppose that we are interested in the mean burning rate of a solid propellant used to power aircrew escape systems. we are interested in deciding whether or not the mean burning rate is 50 centimeters per second. Suppose that a sample of n 10 specimens is tested and that the sample mean burning rate of 48.5 is observed .Previous experience show that the standard deviation of burning rate is 2.5 Solution: 1. State the appropriate null and alternative hypotheses; H0: μ = 50; H1: μ ≠ 50 (This is a two-tail test) 2. Specify the desired level of significance and the sample size; Suppose that = 0.05 3. Determine the appropriate technique; σ is known so this is a Z test. 4. Determine the critical values; For = 0.05 the critical Z values are ±1.96 5. Collect the data and compute the test statistic; Suppose the sample results are: n = 10, X = 48.5, σ = 2.5 (is assumed known), μ = 50 Z Xμ 48.5 50 1.5 1.9 σ 2.5 0.79 n 10 6. Is the test statistic in the rejection region? Condition: Reject H0 if Z < -1.96 or Z > 1.96; otherwise do not reject H0 L4a- Hypothesis Testing-One Sample Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2022-6 Reach a decision and interpret the result: Since Z = -1.9 > -1.96, we do not reject the null hypothesis and conclude that there is sufficient evidence that the mean burning rate is 50 centimeters per second. p-Value Approach to Testing: p-value: Probability of obtaining a test statistic more extreme ( ≤ or ) than the observed sample value given H0 is true. Also called observed level of significance, smallest value of for which H0 can be rejected. Convert sample statistic ( X ) to test statistic (Z statistic ); Obtain the p-value from a table or computer; Compare the p-value with . Decision rule: If p-value < , Reject H0 but if p-value , do not Reject H0 Example: How likely is it to see a sample mean of 48.5 (or something further from the mean, in either direction) if the true mean is = 50? Solution: Convert sample statistic ( X ) to test statistic (Z statistic ) Recall: Z Xμ 48.5 50 1.5 1.9 σ 2.5 0.79 n 10 Therefore: X = 48.5 is translated to a Z score of Z = -1.9 Obtain the p-value from a table or computer P(Z 1.9) 0.0287;P(Z 1.9) 0.0287, p-value = 0.0287 + 0.0287 = 0.0574 Compare the p-value with . If p-value < , Reject H0 but if p-value , do not Reject H0 NOTE: /2 = 0.0287. L4a- Hypothesis Testing-One Sample Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2022-7 Connection to Confidence Intervals: For X = 48.5, σ = 2.5 and n = 10, the 95% confidence interval is: 48.5 - (1.96) 46.9505 ≤ μ ≤ 50.0495 2.5 25 to 48.5 (1.96) 10 10 Since this interval contain the hypothesized mean "μ = 50", we do not reject the null hypothesis at = 0.05. One-Tail Tests: In many cases, the alternative hypothesis focuses on a particular direction. There is only one critical value, since the rejection area is in only one tail. Upper-Tail Tests: There is only one critical value, since the rejection area is in only one tail. L4a- Hypothesis Testing-One Sample Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2022-8 Example: Upper-Tail Z Test for Mean ( Known): A phone industry manager thinks that customer monthly cell phone bills have increased, and now average over $52 per month. The company wishes to test this claim. (Assume = 10 is known). Form hypothesis test: H0: μ ≤ 52 the average is not over $52 per month; H1: μ > 52 the average is greater than $52 per month (i.e. sufficient evidence exists to support the manager’s claim). Solution: Suppose that = 0.20 is chosen for this test. Find the rejection region: Review: One-Tail Critical Value: Test Statistic (Z): Obtain sample and compute the test statistic. Suppose a sample is taken with the following results: n = 64, X = 53.1, μ = 52 and =10 (assumed known). Then the test statistic is: Z Xμ 53.1 52 0.88 σ 10 n 64 L4a- Hypothesis Testing-One Sample Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2022-9 Decision: Reach a decision and interpret the result: Do not reject H0 since Z = 0.88 ≤ 1.28 i.e. There is no sufficient evidence that the mean bill is over $52. p -Value Solution: Calculate the p-value and compare to (assuming that μ = 52.0). 53.1 52.0 P(X 53.1); P Z P(Z 0.88) 1 0.8106 0.1894 10/ 64 Do not reject H0 since p-value = 0.1894 > = 0.20 t Test of Hypothesis for the Mean (σ Unknown): Convert sample statistic ( X ) to a t test statistic. L4a- Hypothesis Testing-One Sample Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2022-10 Example: Two-Tail Test ( Unknown): The average cost of a hotel room in New York is said to be $168 per night. A random sample of 25 hotels resulted in X = $172.50 and S = $15.40. Test at the = 0.05 level. Assume the population distribution is normal. Form hypothesis test: H0: μ = 168 (Null Hypothesis) H1: μ ≠ 168 (Alternative Hypothesis) Given: n = 25, = 0.05,S = 15.40, is unknown (use a t statistic). Solution: Recall: t statistic t n 1 X μ 172.50 168 1.46 S 15.40 n 25 Critical Value: t24 = ± 2.0639 Do not reject H0: not sufficient evidence that true mean cost is different than $168. Connection to Confidence Intervals: For X = 172.5, S = 15.40 and n = 25, the 95% confidence interval is: 172.5 - (2.0639) 15.4/25 to 172.5 + (2.0639) 15.4/25 166.14 ≤ μ ≤ 178.86 Since this interval contains the Hypothesized mean (168), we do not reject the null hypothesis at = 0.05 Hypothesis Tests for Proportions: Involves categorical variables Two possible outcomes: L4a- Hypothesis Testing-One Sample Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2022-11 “Success” (possesses a certain characteristic) “Failure” (does not possesses that characteristic) Fraction or proportion of the population in the “success” category is denoted by π. Proportions: Sample proportion in the success category is denoted by p. p X number of successesin sample n sample size When both nπ and n(1-π) are at least 5, p can be approximated by a normal distribution with mean and standard deviation. μp σp (1 ) n Hypothesis Tests for Proportions: Example: Z Test for Proportion: A marketing company claims that it receives 8% responses from its mailing. To test this claim, a random sample of 500 were surveyed with 25 responses. Test at the = 0.05 significance level. Check: n π = (500)(.08) = 40 n(1-π) = (500)(.92) = 460 Solution: H0: π = 0.08 H1: π ≠ 0.08 L4a- Hypothesis Testing-One Sample Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2022-12 = 0.05, n = 500, p = 0.05 Critical Values: ± 1.96 Test Statistic: Z Decision: Reject H0 at = 0.05 Conclusion: p .05 .08 2.47 (1 ) .08(1 .08) n 500 There is sufficient evidence to reject the company’s claim of 8% response rate. p-Value Solution: Calculate the p-value and compare to (For a two-tail test the p-value is always two-tail) P(Z 2.47) P(Z 2.47) 2(0.0068) 0.0136 Therefore, p-value = 0.0136 Conclusion: Reject H0 since p-value = 0.0136 < = 0.05 Hypothesis Tests for Variance: Use chi-square distribution. Variances: n 1s 2 2 2 Example: Suppose a regulatory agencies specify that the standard deviation of the amount of fill in 16-ounce cans should be less than 0.1 ounce. To determine whether the process is meeting this specification, the supervisor randomly selects 10 cans and weighs the contents of each. The descriptive analysis showed that the cans has a mean of 16.026 L4a- Hypothesis Testing-One Sample Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2022-13 and standard deviation of 0.0412.Is there sufficient evidence to conclude that the true standard deviation s of the fill measurements of 16-ounce cans is less than 0.1 ounce? Solution: Testing the hypothesis: H0: σ2 ≥ 0.01 H1: σ2 0.01 The calculated value will be: From the Chi table with df of 9 the critical value is 3.325 n 1s 2 2 1.53 Conclusion: Since the value of the test statistic within the rejection region the we reject the null hypothesis, and the supervisor can conclude that the variance of the population of all amounts of fill is less than .01 Calculating Type II Error Probabilities β: Type I error, rejecting a true hypothesis α=Probability of rejecting H0 when H0 is true α = P(reject H0 |H0 true) α = The level of significance of a test Type II error, failing to reject a false hypothesis β = Probability of failing to reject H β= P(fail to reject H0 |H0 false) 1-β = Probability of rejecting H0 when H0 is false 1-β = The power of test (is the probability that the test will respond correctly by OR OR rejecting a false null hypothesis Calculating Type II Error Probabilities β: To calculate P(Type II), or β … 1. Calculate the value (s) of X that divide the “do not reject” region from the “reject” region(s). Upper-tailed test: s x0 0 z x 0 z n L4a- Hypothesis Testing-One Sample Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2022-14 Lower-tailed test: Two-tailed test: s x0 0 z x 0 z n s x0 L 0 z / 2 x 0 z / 2 n s x0U 0 z / 2 x 0 z / 2 n 2. Calculate the z-value of X 0 assuming the alternative hypothesis mean is the true mean µ: The probability of getting this z-value is β. Example 1: Oxford Cereals Company specifications require a mean weight of 368 grams per box ,the filling process is subject to periodic inspection from a representative of the consumer affairs office. The representative’s job is to detect the possible “short weighting” of boxes, which means that cereal boxes having less than the specified 368 grams are sold. Thus, the representative is interested in determining whether there is evidence that the cereal boxes have a mean weight that is less than 368. Suppose that the sample of 25 cereal boxes are selected at random , and the population standard deviation is 15 grams. find the probability of making type two error and power of the test if the actual population mean is 360 grams. Solution: H0: µ ≥ 368 (filling process is working properly) Ha: µ < 368 (filling process is not working properly) s 15 x0 0 z 368 (1.645) 363.065 n 25 Z Power of the test = 0.8461 β =1-P(Z≤1.02) =1 - 0.8461= 0.1539 X μ 363.065 360 ;Z 1.02 σ 15 n 25 L4a- Hypothesis Testing-One Sample Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2022-15 Type I error conclude that the population mean fill was less than 368 when it actually greater or equal to 368. This error would result in adjusting the filling process even though the process was working properly. If you did not reject a false null hypothesis, you would make a Type II error and conclude that the population means fill was greater or equal to 368 when it actually was less than 368. Here, you would allow the process to continue without adjustment even though the process was not working properly. Example 2: A textile fiber manufacturer is investigating a new drapery yarn, which the company claims has a mean thread elongation of 12 kilograms with a standard deviation of 0.5 kilograms. The company wishes to test the hypothesis Ho ≥ 0 against Ha < 0 using a random sample of 16 specimens. Find β for the case where the true mean elongation is 11.25 kilograms. if the critical region is defined as (x bar) 11.5 kilograms? Solution: X μ 11.5 11.25 ;Z 2 σ 0.5 n 16 Z β =1-P(Z≤2) =1 - 0.9772= 0.0228 Type II Error: In many practical problems, a specific value for an alternative will not be known, and consequently cannot be calculated. Choose an appropriate significance level α and a test statistic that will make β as small as possible. Set up our hypothesis so that if the test statistic falls into the rejection region, we reject H0 ,knowing that the risk of a type I error is fixed. At α: If we do not reject we state that the evidence is insufficient to reject H0 . We do not affirmatively accept H0. L4a- Hypothesis Testing-One Sample Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2022-16 L4a- Hypothesis Testing-One Sample Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2022-17 TWO-SAMPLE TESTS. Example: DIFFERENCE BETWEEN TWO MEANS: Goal: Test hypothesis or form a confidence interval for the difference between two population means, μ1 – μ2. The point estimate for the difference is: X1 X 2 . Different data sources: Unrelated and Independent. Sample selected from one population has no effect on the sample selected from the other population. Use the difference between 2 sample means Use Z test, a pooled-variance t test or a separatevariance t test L4b- Hypothesis Testing-Two Samples Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2022-1 σ1 and σ2 Known: Assumptions: Samples are randomly and independently drawn Population distributions are normal or both sample sizes are 30 Population standard deviations are known When σ1 and σ2 are known and both populations are normal or both sample sizes are at least 30, the test statistic is a Zvalue… 2 Therefore, the standard error of X1 X 2 is: and the test statistic for μ1 – μ2 is: Z σ X1 X 2 X X μ 1 2 2 1 2 σ1 σ 2 n1 n 2 Hypothesis Tests for Two Population Means: 2 σ σ 1 2 n1 n 2 Two Population Means, Independent Samples. L4b- Hypothesis Testing-Two Samples Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2022-2 μ2 Hypothesis tests for μ1 – μ2: Two Population Means, Independent Samples. Confidence Interval, σ1 and σ2 Known: The confidence interval for μ1 – μ2 is: 2 2 σ σ X1 X 2 Z 1 2 n1 n 2 σ1 and σ2 Unknown, Assumed Equal: Assumptions: Samples are randomly and independently drawn Population distributions are normal or both sample sizes are at least 30 Population variances are unknown but assumed equal Forming interval estimates: The population variances are assumed equal, so use the two sample variances and pool them to estimate the common σ2. The test statistic is a t value with (n1 + n2 – 2) degrees of freedom. Then, the pooled variance is: S 2 p 2 2 n1 1S1 n 2 1S2 (n1 1) (n 2 1) L4b- Hypothesis Testing-Two Samples Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2022-3 The test statistic for μ1 – μ2 is: t Where: t has (n1 + n2 – 2) d.f. and X 1 X 2 μ1 μ 2 1 1 S2p n1 n 2 S 2 p 2 2 n1 1S1 n 2 1S2 (n1 1) (n 2 1) Confidence Interval, σ1 and σ2 Unknown: The confidence interval for μ – μ is: X X t 1 Where: S 2 p 2 1 2 n1 n 2 - 2 1 1 S2p n1 n 2 2 2 n1 1S1 n 2 1S2 (n1 1) (n 2 1) σ1 and σ2 Unknown, Not Assumed Equal: Assumptions: Samples are randomly and independently drawn Population distributions are normal or both sample sizes are at least 30 Population variances are unknown but cannot be assumed to be equal Forming the test statistic: The population variances are not assumed equal, so include the two sample variances in the computation of the t-test statistic. The test statistic is a t value with v degrees of freedom i.e. The number of degrees of freedom is the integer portion of: L4b- Hypothesis Testing-Two Samples Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2022-4 2 S12 S2 2 n n 2 12 2 S12 S2 2 n n 1 2 n1 1 n 2 1 The test statistic for μ1 – μ2 is: t X X μ μ 1 2 1 2 S12 S22 n1 n 2 RELATED POPULATIONS: Tests Means of 2 Related Populations: Paired or matched samples Repeated measures (before/after) Use difference between paired values: Di = X1i - X2i, Di is called the ith paired difference. Eliminates Variation Among Subjects Assumptions: Both Populations Are Normally Distributed; Or if not Normal, use large samples. Mean Difference, σD Known: n The point estimate for the population mean paired difference is: D D i 1 i n Suppose the population standard deviation of the difference scores, σD, is known whereby n is the number of pairs in the paired sample. L4b- Hypothesis Testing-Two Samples Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2022-5 Then, the test statistic for the mean difference is a Z value: Z D μD σD n Whereby: μD = hypothesized mean difference σD = population standard deviation of differences n = the sample size (number of pairs) Confidence Interval, σD Known: The confidence interval for μD is: DZ σD n Whereby: n = The sample size (number of pairs in the paired sample) σD = Population standard deviation of differences Z = Test statistic for the mean difference Mean Difference, σD Unknown: If σD is unknown, we can estimate the unknown population standard deviation with a sample standard deviation, SD. n The sample standard deviation is: SD (D D) i 1 2 i n 1 Use a paired t test, the test statistic for D is now a t statistic, with (n-1) d.f. D μD t SD n n Where t has (n - 1) d.f. and SD (D i 1 Confidence Interval, σD Unknown: The confidence interval for μD is: D t n 1 SD n L4b- Hypothesis Testing-Two Samples Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2022-6 i D) 2 n 1 Hypothesis Testing for Mean Difference, σD Unknown: Paired Samples: Example: Paired t Test. Suppose we are interested in learning about the effect of a newly developed gasoline detergent additive on automobile millage. To gather information, seven cars have been assembled, and their gasoline mileages (in units of miles per gallon) have been determined. For each car this determination is made both when gasoline without the additive is used and when gasoline with the additive is used. The data can be represented as follows: Mileage: Car: Without Additive: With Additive: Di 1 24.2 23.5 0.7 2 30.4 29.6 0.8 3 32.7 32.3 0.4 4 19.8 17.6 2.2 5 25 25.3 -0.3 6 24.9 25.4 -0.5 7 22.2 20.6 1.6 Solution: n D Di i 1 n n = 0.7 ; SD (D i 1 i D) 2 n 1 = 0.966 Test, at the 5 percent level of significance, the null hypothesis that the additive does not change the mean number of miles obtained per gallon of gasoline ( = .05; /2 = .025; D = .7; SD = .966 and d.f. = n-1 = 6). L4b- Hypothesis Testing-Two Samples Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2022-7 Therefore, the critical value = + 2.447 Form Hypothesis Test: H0: μD = 0 H1: μD 0 Test Statistic: Decision: Do not reject H0 (t statistics is not in the reject region). Conclusion: There is not a significant change in the millage. t D μD 0.7 0 1.917 SD / n 0.966/ 7 TWO POPULATION PROPORTIONS: Goal: test a hypothesis or form a confidence interval for the difference between two population proportions, π1 – π2 Assumptions: n1 π1 5 , n1(1- π1) 5 n2 π2 5 , n2(1- π2) 5 The point estimate for the difference is: p1 p 2 Since we begin by assuming the null hypothesis is true, we assume π1 = π2 and pool the two sample estimates. p X1 X 2 n1 n 2 The pooled estimate for the overall proportion is: Whereby: X1 and X2 are the numbers from sample 1 and 2 with the characteristic of interest. The test statistic for p1 – p2 is a Z statistic: Z p1 p 2 π1 π2 1 1 p (1 p) n1 n 2 L4b- Hypothesis Testing-Two Samples Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2022-8 Whereby: p X1 X 2 X X , p1 1 , p 2 2 n1 n 2 n1 n2 Confidence Interval for Two Population Proportions: The confidence interval for π1 – π2 is: p1 p 2 Z p1 (1 p1 ) p 2 (1 p 2 ) n1 n2 Hypothesis Tests for Two Population Proportions: Population proportions: Example: Two population Proportions: Is there a significant difference between the proportion of men and the proportion of women who will vote Yes on Proposition A? In a random sample, 36 of 72 men and 31 of 50 women indicated they would vote Yes. Test at the .05 level of significance. Solution: The hypothesis test is: H0: π1 – π2 = 0 (The two proportions are equal) H1: π1 – π2 ≠ 0 (There is a significant difference between proportions) L4b- Hypothesis Testing-Two Samples Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2022-9 The sample proportions are: Men: p1 = 36/72 = .50 Women: p2 = 31/50 = .62 The pooled estimate for the overall proportion is: X1 X 2 36 31 67 .549 n1 n 2 72 50 122 p For = .05, Critical Values = ±1.96 The test statistic for π1 – π2 is: p1 p 2 1 2 .50 .62 0 z Decision: Do not reject H0 Conclusion: There is not significant evidence of a difference in proportions who 1 1 p (1 p) n1 n 2 1 1 .549 (1 .549) 72 50 1.31 will vote yes between men and women. HYPOTHESIS TESTS FOR VARIANCES: The F test statistic is: S12 = Variance of Sample 1 S12 F 2 S2 S 22 = Variance of Sample 2 n1 - 1 = numerator degrees of freedom n2 - 1 = denominator degrees of freedom L4b- Hypothesis Testing-Two Samples Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2022-10 The F Distribution: The F critical value is found from the F table There are two appropriate degrees of freedom: numerator and denominator S12 F 2 S2 where df1 = n1 – 1 ; df2 = n2 – 1 In the F table: Numerator degrees of freedom determine the column Denominator degrees of freedom determine the row Finding the Rejection Region: To find the critical F values: Find FU from the F table for (n1 – 1) numerator and (n2 – 1) denominator degrees of freedom. Find FL using the formula: FL 1 FU * Where FU* is from the F table with n2 – 1 numerator and n1 – 1 denominator degrees of freedom (i.e. switch the d.f. from FU) Example: F Test: You are a financial analyst for a brokerage firm. You want to compare dividend yields between stocks listed on the NYSE & NASDAQ. You collect the following data: L4b- Hypothesis Testing-Two Samples Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2022-11 NYSE NASDAQ 21 25 Mean 3.27 2.53 Std dev 1.30 1.16 Number Is there a difference in the variances between the NYSE & NASDAQ at the = 0.05 level? Solution: Form the hypothesis test: H0: σ21 – σ22 = 0 (There is no difference between variances) H1: σ21 – σ22 ≠ 0 (There is a difference between variances) Find the F critical values for = 0.05: S12 1.302 1.256 The test statistic is: F 2 S2 1.162 Decision: F = 1.256 is not in the rejection region, so we do not reject H0 Conclusion: There is not sufficient evidence of a difference in variances at = .05 L4b- Hypothesis Testing-Two Samples Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2022-12 ANALYSIS OF VARIANCE. Terminologies: Block: Group of homogeneous experimental units. Design (layout): Complete speciﬁcations of experimental test runs, including blocking, randomization, repeat tests, replication, and the assignment of factorlevel combinations to experimental units. Effect: Change in the average response between two factor-level combinations or between two experimental conditions. Factor: A controllable experimental variable that is thought to inﬂuence the response. Level: Specific value of a factor. Repeat tests: Two or more observations that have the same levels for all the factors. Replication: Repetition of an entire experiment or a portion of an experiment under two or more sets of conditions. Response: Outcome or result of an experiment or observation. Example: For example, suppose that we are interested in comparing the yields per plot of deferent varieties of corn. Then, the yield per plot is the response, the variety of corn is the factor and deferent varieties of corn are the levels of this factor, plots are the experimental units. General ANOVA Setting: Investigator controls one or more independent variables: Called factors (or treatment variables): characteristics which differentiates treatments/populations from one another; variables whose effect is of interest to researcher; Each factor contains two or more levels (or groups or categories/ classifications); values of the factors utilized in the experiment. L5-Analysis of Variance Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2022-1 Observe effects on the dependent variable: Response to levels of independent variable. Experimental design: The plan used to collect the data. Completely Randomized Design: 1. Experimental units (subjects) are assigned randomly to treatments: Subjects are assumed homogeneous 2. Only one factor or independent variable: With two or more treatment levels 3. Analyzed by one-way analysis of variance (ANOVA). One-Way Analysis of Variance: Examples: Effects of five (levels) different brands (factors) of gasoline on automobile engine operating efficiency. Effects of the presence of four (levels) different sugar solutions (factors) on bacterial growth. Assumptions: Populations are normally distributed Populations have equal variances Samples are randomly and independently drawn Hypotheses of One-Way ANOVA: H 0 : μ1 μ 2 μ 3 μ c All population means are equal; i.e. no treatment effect (no variation in means among groups). H1: Not all of the population means are the same. At least one population mean is different i.e. there is a treatment effect; Does not mean that all population means are different (some pairs may be the same). L5-Analysis of Variance Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2022-2 One-Factor ANOVA: H 0 : μ1 μ 2 μ 3 μ c H1 : Not all μ j are the same All Means are the same: The Null Hypothesis is True (No Treatment Effect). H 0 : μ1 μ 2 μ 3 μ c H1 : Not all μ j are the same At least one mean is different: The Null Hypothesis is NOT true (Treatment Effect is present). Partitioning the Variation: Total variation can be split into two parts: SST = SSA + SSW Whereby: SST = Total Sum of Squares (Total variation) SSA = Sum of Squares Among Groups (Among-group variation) SSW = Sum of Squares Within Groups (Within-group variation) Total Variation = The aggregate dispersion of the individual data values across the various factor levels (SST). Among-Group Variation = Dispersion between the factor sample means (SSA). Within-Group Variation = Dispersion that exists among the data values within a particular factor level (SSW). Partition of Total Variation: L5-Analysis of Variance Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2022-3 Total Sum of Squares: SST = SSA + SSW c nj SST ( X ij X ) 2 j 1 i 1 Whereby: SST = Total sum of squares c = number of groups (levels) nj = number of observations in group j Xij = ith observation from group j X = grand mean (mean of all data values) Total Variation: SST ( X 11 X ) 2 ( X 12 X ) 2 ... ( X cnc X ) 2 Among-Group Variation: Whereby: SSA = Sum of squares among groups c = number of groups nj = sample size from group j Xj = sample mean from group j X = grand mean (mean of all data values) SSA n1 ( x1 x )2 n2 ( x2 x )2 ... nc ( xc x )2 c SSA n j ( X j X ) 2 j 1 L5-Analysis of Variance Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2022-4 Mean Square Among: Mean Square Among = SSA/Degrees of freedom. MSA SSA c 1 Within-Group Variation: Whereby: SSW = Sum of squares within groups c = number of groups nj = sample size from group j Xj = sample mean from group j Xij = ith observation in group j SSW ( x11 X 1 ) 2 ( X 12 X 2 ) 2 ... ( X cnc X c ) 2 c SSW j 1 Mean Square Within: Mean Square Within = SSW/Degrees of freedom. MSW SSW nc L5-Analysis of Variance Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2022-5 nj i 1 ( X ij X j ) 2 Obtaining the Mean Squares: MSA SSA SSW ; MSW c 1 nc ; MST SST n 1 One-Way ANOVA Table: Whereby: c = number of groups n = sum of the sample sizes from all groups df = degrees of freedom ONE-WAY ANOVA F test STATISTIC. H0: μ1= μ2 = … = μc H1: At least two population means are different Test statistic: F MSA MSW Whereby: MSA is mean squares among groups MSW is mean squares within groups Degrees of freedom: df1 = c – 1 (c = number of groups) df2 = n – c (n = sum of sample sizes from all populations) Interpreting One-Way ANOVA F Statistic: The F statistic is the ratio of the among estimate of variance and the within estimate of variance. The ratio must always be positive df1 = c -1 will typically be small df2 = n - c will typically be large L5-Analysis of Variance Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2022-6 Decision Rule: Reject H0 if F > FU, otherwise do not reject H0 Example: One-Way ANOVA F Test: An experiment was conducted to compare the wearing qualities of three types of paint when subjected to the abrasive action of a slowly rotating cloth-surfaced wheel. Ten paint specimens were tested for each paint type, and the number of hours until visible abrasion was apparent was recorded for each specimen. At the 0.05 significance level, is there sufficient evidence to indicate a difference in the mean time until abrasion is visibly evident for the three paint types? Given: n1 = 10; n2 = 10; n3 = 10; n = 30; c = 3; = 0.05 Solution: Paint 1: 2296/10 = 229.6; Therefore, x1 229.6 Paint 2: 3099/10 = 309.9; Therefore, x 2 309.9 Paint 3: 4270/10 = 427.8; Therefore, x 3 427.8 Grand mean: (229.6+309.9+427.8)/3 = 322.4 x 322.4 1. Obtain Variation due to Factor (SSA): Where: d.f. = c - 1. c SSA n j ( X j X ) 2 SSA n1 ( x1 x )2 n2 ( x2 x )2 ... nc ( xc x )2 j 1 SSA = 10 (229.6 – 322.4)2 + 10 (309.9 – 322.4)2 + 10 (427.8 – 322.4)2 = 198772.5 2. Obtain Mean Squares Among (MSA): Mean Square Among = SSA/Degrees of freedom. L5-Analysis of Variance Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2022-7 MSA SSA = 198772.5 / (3 - 1) = 99386.2 c 1 3. Obtain Variation due to Random Sampling (SSW): Where: d.f. = n - c c nj SSW SSW ( x11 X 1 ) 2 ( X 12 X 2 ) 2 ... ( X cnc X c ) 2 j 1 i 1 ( X ij X j ) 2 SSW = (148 – 229.6)2 + (76 – 229.6)2 +…+ (465 – 427.8)2 = 770670.9 4. Obtain Mean Squares Within (MSW): Mean Square Within = SSW/Degrees of freedom. MSW SSW nc = 770670.9 / (30 - 3) = 28543.4 5. Obtain F ratio or Test Statistic: F MSA 99386.2 3.48 = F MSW 28543.4 Form Hypothesis: H0: μ1= μ2 = … = μc H1: μj not all equal Given: = 0.05; df1= 2; df2 = 27 Fu = 3.35 from T-16 (F-Critical Values). Decision: Reject H0 at = 0.05 Conclusion: There is evidence that, at least one μj differs from the rest. L5-Analysis of Variance Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2022-8 ANOVA Table: Single Factor: ANOVA Source of Variation: SS: df: MS: Between Groups: 198772.5 2 99386.23 Within Groups: 770670.9 27 28543.4 969443.4 29 Total: F: 3.48 F crit.: 3.35 Alternative method- one way ANOVA: SST = c ( X 11 X ) 2 ( X 12 X ) 2 .. ( X cnc X ) 2 or nj 1 2 2 X T ij n i 1 j 1 Whereby: SST = Total Sum of Square 1 2 T = Correction Factor n T 2 = The square of the grand total c SSA = c i 1 n (X j 1 1 nj j j X )2 or 2 nj 1 X j T 2 n j 1 Whereby: SSA = Sum of squares among groups c = number of groups nj = sample size from group j Xj = sample mean from group j X = grand mean (mean of all data values) Example: F Test. Suppose in an industrial experiment that an engineer is interested in how the mean absorption of moisture in concrete varies among 5 different concrete aggregates. The samples are exposed to moisture for 48 hours. It is decided that 6 samples are to be L5-Analysis of Variance Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2022-9 tested for each aggregate, requiring a total of 30 samples to be tested. The data are recorded as follows: Test appropriate hypothesis at 0.05 level of significant for the given data. Given: c = 5; nj = 6; n = 30; = 0.05 Solution: Ho: μ1= μ2= μ3=μ4 Ha: at least two of the μi are unequal c X i 1 j 1 c nj 2 ij 9677954 ij = 16854; nj X i 1 j 1 c 1 2 T = (16854)2 /30 = 9468577.2 n nj 1 SST = X ij2 T 2 = 9677954- 9468577.2= 209376.8 n i 1 j 1 c SSA = i 1 1 2 1 2 1 1 1 1 1 X i T = (3320)2 + (3416)2 + (3663)2 + (2791)2 (3664)2 - 9468577.2 nj n 6 6 6 6 6 Therefore: SSA = 9553933.7 - 9468577.2 = 85356.467 SWW = SST - SSA = 209376.8 - 85356.467 = 124020.33 MSA = 21339.12 MSW = 4960.813 F= 4.3 L5-Analysis of Variance Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2022-10 ANOVA Table: Single Factor: ANOVA Source of Variation: SS: df: MS: Between Groups: 85356.467 4 21339.12 Within Groups: 124020.33 25 4960.813 209376.8 29 Total: F: 4.3 F crit.: 2.76 Decision: Reject H0 Conclusion: The aggregates do not have the same mean absorption. THE TUKEY-KRAMER PROCEDURE. Tells which population means are significantly different: Example: μ1 = μ2 ≠ μ3 Done after rejection of equal means in ANOVA Allows pair-wise comparisons: Compare absolute mean differences with critical range Critical Range Q U MSW 1 1 2 n j n j' Whereby: QU = Value from Studentized Range Distribution with c and (n - c) degrees of freedom for the desired level of MSW = Mean Square Within nj and nj’ = Sample sizes from groups j and j’ x .1 x .2 x .1 x .3 The Critical Range will be compared with: x .2 x .3 etc... Is x .j x .j' Critical Range ? L5-Analysis of Variance Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2022-11 If the absolute mean difference is greater than the critical range then there is a significant difference between that pair of means at the chosen level of significance. Example: The Tukey-Kramer Procedure. 1. Compute Absolute Mean Differences: Solution: Paint 1: 2296/10 = 229.6; Therefore, x1 229.6 Paint 2: 3099/10 = 309.9; Therefore, x 2 309.9 Paint 3: 4270/10 = 427.8; Therefore, x 3 427.8 x1 x 2 229.6 309.9 80.3 x1 x 3 229.6 427.8 198.2 x 2 x 3 309.9 427.8 117.9 2. Find the QU value from the table given: c = 3; n = 30; = 0.05 Solution: n - c = 30 - 3 = 27; Therefore, QU 3.53 3. Compute Critical Range: Critical Range Q U MSW 1 1 28543.4 1 1 3.53 188.6 2 n j n j' 2 10 10 4. Compare: Critical range: Mean Absolute Difference: x 1 x 2 80 .3 188.6 x1 x 3 198.2 x 2 x 3 117.9 5. Decision: Since one of the absolute mean differences is greater than critical range. Therefore there is a significant difference between one pair of means at 5% level of significance. 6. Conclusion: Thus, with 95% confidence we can conclude that the mean distance for paint 3 is greater than paint 1. L5-Analysis of Variance Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2022-12 THE RANDOMIZED BLOCK DESIGN. Like One-Way ANOVA, we test for equal population means (for different factor levels)... ...but we want to control for possible variation from a second factor (with two or more levels). Levels of the secondary factor are called blocks. The randomized block design consists of two step procedure: Matched set of experimental units, called blocks are formed each block consist of p experimental units (where p is the number of treatments). The blocks should consist of experimental units that are similar as possible. One experimental unit from each block is randomly assigned to each treatment, resulting in total of n=bp responses. Examples: Testing tensile strength of wires produced using different machines, testing diﬀerent methods of production using various operators, testing diﬀerent brands of tires for diﬀerent passenger cars, testing diﬀerent teaching methods, or testing a certain number of drugs on a group of animals. In these examples, the diﬀerent blocks consist of machines, operators, cars, students, and animals, respectively. Partitioning the Variation: Total variation can now be split into three parts: SST = SSA + SSBL + SSE Whereby: SST = Total variation SSA = Among-Group variation SSBL = Among-Block variation SSE = Random variation Sum of Squares for Blocking: SST = SSA + SSBL + SSE Whereby: r SSBL c (Xi. X) 2 i 1 c = Number of groups r = Number of blocks X i. = Mean of all values in block i X = Grand mean (mean of all data values) Partitioning the Variation: Total variation can now be split into three parts: L5-Analysis of Variance Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2022-13 SST = SSA + SSBL + SSE Whereby: SST and SSA are computed as they were in One-Way ANOVA. r c (X i. X) 2 SSBL = SSE = SST – (SSA + SSBL) i 1 Mean Squares: SSBL r 1 MSBL = Mean Square Blocking: MSA = Mean Square among Groups: MSE = Mean Square Error: SSA c 1 SSE (r 1)(c 1) Randomized Block ANOVA Table: Source of Variance: SS: df: MS: F-ratio: Among Treatments SSA c-1 MSA MSA / MSE Among Blocks SSBL r-1 MSBL MSBL / MSE Error SSE (r–1)(c-1) MSE Total SST rc - 1 Whereby: c = Number of populations rc = Sum of the sample sizes from all populations r = Number of blocks df = Degrees of freedom Blocking Test: H 0 : μ1. μ 2. μ 3. ... H1 : Not all block means are equal Blocking test: df1 = r – 1 df2 = (r – 1)(c – 1) F = MSBL / MSE; Reject H0 if F > FU L5-Analysis of Variance Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2022-14 Main Factor Test: H 0 : μ1. μ 2. μ 3. ... H1 : Not all population means are equal Main Factor test: df1 = c – 1 df2 = (r – 1)(c – 1) F = MSA / MSE; Reject H0 if F > FU Randomized Block ANOVA Table: Source of Variance: SS: df: MS: F-ratio: Among Treatments SSA c-1 MSA MSA / MSE Among Blocks SSBL r-1 MSBL MSBL / MSE Error SSE (r–1)(c-1) MSE Total SST rc - 1 Multiple comparison of means: Tukey-Kramer Procedure: Equal sample size Bonferroni: Does not require equal sample size Scheffé: Compare all possible linear combination Apply your knowledge: A consumer testing organization wished to compare the annual power consumption for five different brands of dehumidifier. Because power consumption depends on the prevailing humidity level, it was decided to monitor each brand at four different levels ranging from moderate to heavy humidity (thus blocking on humidity level). Within each level brands were randomly assigned to the five selected locations. The resulting amount of power consumption (annual kwh) are: Treatments Brands: Blocks (Humidity Level): 1 2 3 4 1 685 792 838 875 2 722 806 893 953 3 733 802 880 941 4 811 888 952 1005 5 828 920 978 1023 L5-Analysis of Variance Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2022-15 L5-Analysis of Variance Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2022-16 EXPERIMENTAL DESIGN. THE EXPERIMENTAL DESIGN PROCESS: Design of Experiments (DOE) defined: A theory concerning the minimum number of experiments necessary to develop an empirical model of a research question and a methodology for setting up the necessary experiments. Design of Experiment Constraints: Time and Money. Why conduct experiment: To determine the principal causes of variation in a measured response; To ﬁnd the conditions that give rise to a maximum or minimum response; To compare the responses achieved at different settings of controllable variables; To obtain a mathematical model in order to predict future responses. Benefits of experimental design: Design a proper set of experiments for measurement or simulation; Develop a model that best describes the data obtained and check model adequacy; Estimate the contribution of each alternative to the performance; Isolate the measurement errors; Estimate confidence intervals for model parameters; Check if the alternatives are significantly different. L6-Experimental Design Process Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2022-1 Common mistakes in experimentation: The variation due to experimental error is ignored; Important parameters are not controlled; Effects of different factors are not isolated; Simple one-factor-at-a-time designs are used; Interactions are ignored: An interaction is the failure of one factor to produce the same effect on the response at different levels of another factor; Too many experiments are conducted. EXPERIMENTAL DESIGN BASICS: Two kinds of data gathering methodologies: Observation: Can’t prove cause & effect but can establish associations. Experimental: Can proveCause & effect; Variables of interest: Factors vs. Treatments. Independent variable: Treatment: Manipulations of variables of interest; Treatment vs. Control group. Dependent variable is what you are measuring: Example: Optimize the various operating parameters for enhancing the performance and heat transfer characteristics of solar parabolic through collectors ( (PTC). The independent variables are chosen as follows: Parameters (Factors): Values (Levels / Treatments): Diameter of receiver (m) 0.03 0.026 0.021 0.001756 0.001578 0.001311 Copper (Cu) Aluminium (Al) Galvanized steel (GI ) Mass flow rate (kg/s) Material of receiver Response Variable: Outcome Example: Performance, Throughput... Factors: Variables that affect the response variable. Example: Diameter of receiver, Mass flow rate, Material of receiver. They are also called Predictor variables or Predictors. Levels: The values that a factor can assume. Also called Treatments. Example: Mass flow rate has three levels: 0.001756; 0.001578 and 0.001311 L6-Experimental Design Process Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2022-2 Primary Factors: The factors whose effects need to be quantified. Confounds: Randomization Concerns: Randomization prevents experimental bias. Assignment by experimenter: Counterbalancing. Statistical assumptions. A requirement for statistical tests of significance. Design of Experiment Terminologies: Replications: Independent observations of a single treatment. Repeated measures: Each subject is measured at two or more points with respect to time. Variance: The measuring stick that compares different treatments. Internal validity: The extent to which an experiment accomplishes its goal(s). Reproducibility: Given the appropriate information, the ability of others to replicate the experiment. External validity: How representative of the target population is the sample? Can the results be generalized? Generalizations for field experiments are easier to justify than lab experiments because of artificialities. Medical Trials: Placebo Double Blind BASIC PRINCIPLES OF EXPERIMENTAL DESIGNS: The Principle of Replication; The Principle of Randomization; The Principle of Local Control. L6-Experimental Design Process Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2022-3 1. Principle of Replication: The experiment should be repeated more than once; Each treatment is applied in many experimental units instead of one; This increases the statistical accuracy of the experiments; The result so obtained will be more reliable. Conceptually replication does not present any difficulty but computationally it does. Example: If an experiment requiring a two-way analysis of variance is replicated, it will then require a three-way analysis of variance since replication itself may be a source of variation in the data. However, it should be remembered that replication is introduced in order to increase the precision of a study. 2. Principle of Randomization: Provides protection when you conduct an experiment against the effect of extraneous factors by randomization; This principle indicates that you should design or plan the experiment in such a way that the variations caused by extraneous factors can all be combined under the general heading of “chance.”; The application of the principle of randomization gives a better estimate of the experimental error. 3. Principle of Local Control (Blocking): Under it the extraneous factor (the known source of variability) is made to vary deliberately over as wide a range as necessary; This needs to be done in such a way that the variability it causes can be measured and hence eliminated from the experimental error; This means that you should plan the experiment in a manner that you can perform a two-way analysis of variance in which the total variability of the data is divided into three components attributed to treatments (varieties of rice), the extraneous factor (soil fertility) and experimental error; Blocking is a method of eliminating the effects of unrelated variation due to noise factors and thereby improving the efficiency of experimental design; The main objective is to eliminate unwanted sources of variability such as batch to batch, day-to-day, shift to shift, etc. and arrange similar experimental runs into blocks (or groups). Generally, a block is a set of relatively homogeneous experimental conditions. L6-Experimental Design Process Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2022-4 The blocks can be batches of raw materials, different operators, different vendors, etc.; Observations collected under the same experimental conditions (i.e. same day, same shift, etc.) are said to be in the same block; Variability between blocks must be eliminated from the experimental error, which leads to an increase in the precision of the experiment. FACTORIAL DESIGNS: Full factorial design: Two or more independent variables are manipulated in a single experiment. They are referred to as factors. Levels: These are various ways the independent variable is changed. Major purpose of the research is explore their effect jointly. Factorial design produce efficient experiments, each observation supplies information about all of the factors (all possible combinations). 22 Factorial Design: Two factors, each at two levels (k factors, each at two levels): Example: Workstation Design. Factor 1: Memory size Factor 2: Cache size Dependent variable: Performance. Cache size: Memory size: 4M byte 4M byte 1K 15 45 2K 25 75 2 Combination and Interaction in a 2 Experiment: Interaction in a 22 Experiment. L6-Experimental Design Process Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2022-5 2k Factorial Design: k factors, each at two levels: 3 Example: 2 design: In designing a personal workstation, the three factors needed to be studied are: Cache size, Memory size and Number of processors. Factors: Level 1 Level 2 Cache size 1K 2K Memory size 154Mb 458Mb No. of processor 1 2 Combination and Interaction in a 23 Experiment: Interaction in a 23 Experiment. L6-Experimental Design Process Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2022-6 SIMPLE LINEAR REGRESSION. Correlation vs. Regression: A scatter diagram can be used to show the relationship between two variables. Correlation analysis is used to measure strength of the association (linear relationship) between two variables. Correlation is only concerned with strength of the relationship; No causal effect is implied with correlation. INTRODUCTION TO REGRESSION ANALYSIS: Regression analysis is used to: Predict the value of a dependent variable based on the value of at least one independent variable; Explain the impact of changes in an independent variable on the dependent variable. Dependent variable: The variable we wish to predict or explain. Independent variable: The variable used to explain the dependent variable. SIMPLE LINEAR REGRESSION MODEL: Only one independent variable, X; Relationship between X and Y is described by a linear function; Changes in Y are assumed to be caused by changes in X. Types of Relationships: L7-Linear Regression Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2022-1 Simple Linear Regression Model: Simple Linear Regression Equation (Prediction Line): The simple linear regression equation provides an estimate of the population regression line. The individual random error terms ei have a mean of zero. L7-Linear Regression Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2022-2 Least Squares Method: b0 and b1 are obtained by finding the values of b0 and b1 that minimize the sum of the squared differences between Y and Yˆ : min (Yi Ŷi ) 2 min (Yi (b0 b1X i ))2 Least Squares Method: Model: Estimates: Deviation: SSE: Formulas for the Least Squares Estimates: Interpretation of the Slope and the Intercept: b0 is the estimated average value of Y when the value of X is zero; b1 is the estimated change in the average value of Y as a result of a one-unit change in X. Example: Simple Linear Regression. A real estate agent wishes to examine the relationship between the selling price of a home and its size (measured in square feet). A random sample of 10 houses is selected: Dependent variable (Y) = House price in $1000s Independent variable (X) = Square feet L7-Linear Regression Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2022-3 Sample Data for House Price Model: S/N House Price Square Feet in $1000s (Y) (X) (Xi Yi) (Xi2) 01 245 1400 343000 1960000 02 312 1600 499200 2560000 03 279 1700 474300 2890000 04 308 1875 577500 3515625 05 199 1100 218900 1210000 06 219 1550 339450 2402500 07 405 2350 951750 5522500 08 324 2450 793800 6002500 09 319 1425 454575 2030625 10 255 1700 433500 2890000 T 2865 17150 5085975 30983750 Graphical Presentation: House price model: Scatter Plot. Least Squares Method: xi yi xi yi n 2 xi 2 xi n Slope: Y intercept: ˆo y ˆ1 x 286517150 10 0.10977 1715017150 30983750 10 5085975 y xi 2865 0.109768 17150 98.24445 ˆ1 n n 10 10 i L7-Linear Regression Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2022-4 Graphical Presentation: House price model: Scatter Plot and Regression Line. Interpretation of the Intercept - b0: House price 98.24833 0.10977(square feet) Whereby: b0 is the estimated average value of Y when the value of X is zero (if X = 0 is in the range of observed X values). Here, no houses had 0 square feet, so b0 = 98.24833 just indicates that, for houses within the range of sizes observed, $98,248.33 is the portion of the house price not explained by square feet. Interpretation of the Slope Coefficient - b1: House price 98.24833 0.10977(square feet) Whereby: b1 measures the estimated change in the average value of Y as a result of a one-unit change in X. Here, b1 = .10977 tells us that the average value of a house increases by 10977($1000) = $109.77, on average for each additional one square foot of size. Predictions using Regression Analysis: Predict the price for a house with 2000 square feet: House price 98.25 0.1098(sq.ft.) 98.25 0.1098(2000) 317.85 The predicted price for a house with 2000 square feet is 317.85($1,000s) = $317,850. Interpolation vs. Extrapolation: When using a regression model for prediction, only predict within the relevant range of data. L7-Linear Regression Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2022-5 Measures of Variation: Total variation is made up of two parts: SST = Total Sum of Squares SSR + Regression Sum SSE Error Sum of Squares of Squares SST (Yi Y ) 2 SSR (Yˆi Y ) 2 SSE (Yi Yˆi ) 2 Whereby: Y : Average value of the dependent variable Yi: Observed values of the dependent variable Yˆ i: Predicted value of Y for the given Xi value SST = Total sum of squares: Measures the variation of the Yi values around their mean Y SSR = Regression sum of squares: Explained variation attributable to the relationship between X and Y SSE = Error sum of squares: Variation attributable to factors other than the relationship between X and Y L7-Linear Regression Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2022-6 Coefficient of Determination - r2: The coefficient of determination is the portion of the total variation in the dependent variable that is explained by variation in the independent variable. The coefficient of determination is also called r-squared and is denoted as r2. Therefore: r 2 SSR regression sum of squares 2 Whereby: 0 r 1 . SST total sum of squares Examples of Approximate r2 Values: r2 = 1: Perfect linear relationship between X and Y 100% of the variation in Y is explained by variation in X 0 < r2 < 1: Weaker linear relationships between X and Y Some but not all of the variation in Y is explained by variation in X r2 = 0: No linear relationship between X and Y The value of Y does not depend on X. (None of the variation in Y is explained by variation in X) Standard Error of Estimate: The standard deviation of the variation of observations around the regression line is estimated by: n SYX SSE n2 (Y Yˆ ) i i 1 2 i n2 Whereby: SSE: Error sum of squares. n: Sample size. L7-Linear Regression Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2022-7 Comparing Standard Errors: SYX is a measure of the variation of observed Y values from the regression line. The magnitude of SYX should always be judged relative to the size of the Y values in the sample data. i.e. SYX = $41.33K is moderately small relative to house prices in the $200 $300K range. Assumptions of Regression: Use the acronym LINE: Linearity: The underlying relationship between X and Y is linear. Independence of Errors: Error values are statistically independent. Normality of Error: Error values (ε) are normally distributed for any given value of X. Equal Variance (Homoscedasticity): The probability distribution of the errors has constant variance. Residual Analysis: The residual for observation i, ei, is the difference between its observed and predicted value. ei Yi Yˆi Check the assumptions of regression by examining the residuals: Examine for linearity assumption Evaluate independence assumption Evaluate normal distribution assumption Examine for constant variance for all levels of X (Homoscedasticity) Graphical Analysis of Residuals: Can plot residuals vs. X L7-Linear Regression Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2022-8 Residual Analysis for Linearity: Are the data points relatively linear or is it curved or skewed in some way? Residual Analysis for Linearity. Residual Analysis for Independence: Is there any pattern in the residue yes - correlation. Residual Analysis for Normality: A normal probability plot of the residuals can be used to check for normality: Do the residue points fall more or less on a straight line in the normal probability plot? Residual Analysis for Equal Variance: Is there any pattern in the residue high/low yes-heteroscadasticity. L7-Linear Regression Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2022-9 are residue distributed evenly and consistently around the x-axis- yes homoscedasticity. Residual Output: Does not appear to violate any regression assumptions. Inferences About the Slope: The standard error of the regression slope coefficient (b1) is estimated by: Sb1 SYX SSX SYX (X X ) 2 i Whereby: Sb1 : Estimate of the standard error of the least squares slope SYX SSE : Standard error of the estimate n2 Comparing Standard Errors of the Slope: Sb1 : Is a measure of the variation in the slope of regression lines from different possible sample. L7-Linear Regression Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2022-10 Inference about the Slope - t Test: t test for a population slope: Is there a linear relationship between X and Y? Null and alternative hypotheses: H0: β1 = 0 (no linear relationship) H1: β1 ¹ 0 (linear relationship does exist) Test statistic: t b1 β1 ; d.f. n 2 Sb1 Whereby: where: b1: Regression slope coefficient β1: Hypothesized slope Sb: Standard error of the slope Inferences about the Slope - t Test example: H0: β1 = 0 H1: β1 ≠ 0 b1 β1 0.10977 0 3.32938 Sb1 0.03297 t Test Statistic: t = 3.329 L7-Linear Regression Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2022-11 Decision: Reject H0 Conclusion: There is sufficient evidence that square footage affects house price. From output: P-value = 0.01039 Decision: P-value < α so Reject H0 Conclusion: There is sufficient evidence that square footage affects house price. F Test for Significance: F Test statistic: F MSR SSR SSE ; MSR and MSE MSE k n k 1 Whereby: F follows an F distribution with k numerator and (n – k - 1) denominator degrees of freedom. (k = the number of independent variables in the regression model). L7-Linear Regression Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2022-12 Confidence Interval Estimate for the Slope: Confidence Interval Estimate of the Slope: b1 tn 2Sb1 ; d.f. = n - 2 At 95% level of confidence, the confidence interval for the slope is (0.0337, 0.1858). Since the units of the house price variable is $1000s, we are 95% confident that the average impact on sales price is between $33.70 and $185.80 per square foot of house size. This 95% confidence interval does not include 0. Conclusion: There is a significant relationship between house price and square feet at the .05 level of significance. t Test for a Correlation Coefficient: Hypotheses: H0: ρ = 0 (no correlation between X and Y) HA: ρ ≠ 0 (correlation exists) t Test statistic: r -ρ 1 r2 n2 with (n – 2) degrees of freedom. Whereby: r r 2 if b1 0 r r 2 if b1 0 L7-Linear Regression Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2022-13 Example - House Prices: Is there evidence of a linear relationship between square feet and house price at the .05 level of significance? H0: ρ = 0 (No correlation) H1: ρ ≠ 0 (correlation exists) =.05 , df = 10 - 2 = 8 t r ρ 1 r n2 2 .762 0 1 .762 10 2 2 3.329 Solution: Estimating Mean Values and Predicting Individual Values: Goal: Form intervals around Y to express uncertainty about the value of Y for a given Xi Confidence Interval for the Average Y, Given X: Confidence interval estimate for the mean value of Y given a particular Xi Confidence interval for μ Y|X Xi : Yˆ t n 2SYX hi L7-Linear Regression Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2022-14 hi 1 (Xi X) 2 1 (Xi X) 2 n SSX n (Xi X) 2 Whereby: (Xi X) 2 Size of interval varies according to distance away from mean, X . Prediction Interval for an Individual Y, Given X: Confidence interval estimate for an Individual value of Y given a particular Xi. Example - Estimation of Mean Values: Confidence Interval Estimate for μY|X=X i. Find the 95% confidence interval for the mean price of 2,000 square-foot houses. Predicted Price Ŷi = 317.85 ($1,000s). Ŷ t n -2SYX 1 (Xi X) 2 317.85 37.12 n (Xi X) 2 The confidence interval endpoints are 280.66 and 354.90, or from $280,660 to $354,900. Pitfalls of Regression Analysis: Lacking an awareness of the assumptions underlying least-squares regression Not knowing how to evaluate the assumptions Not knowing the alternatives to least-squares regression if a particular assumption is violated Using a regression model without knowledge of the subject matter Extrapolating outside the relevant range Strategies for Avoiding the Pitfalls of Regression: Start with a scatter diagram of X vs. Y to observe possible relationship Perform residual analysis to check the assumptions: L7-Linear Regression Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2022-15 Plot the residuals vs. X to check for violations of assumptions such as Homoscedasticity Use a histogram, stem-and-leaf display, box-and-whisker plot, or normal probability plot of the residuals to uncover possible non-normality If there is violation of any assumption, use alternative methods or models If there is no evidence of assumption violation, then test for the significance of the regression coefficients and construct confidence intervals and prediction intervals Avoid making predictions or forecasts outside the relevant range L7-Linear Regression Lecture Notes by Dr. Mahabi Compiled by Ibrahim Nyirenda, 2022-16