Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
GM – 03 QUANTATIVE TECHNIQUES FOR MANAGERS 11-2 WHAT DOES STATISTICS ACHIEVE Making Decisions Data, Information, Knowledge 1. 2. 3. Data: specific observations of measured numbers. Information: processed and summarized data yielding facts and ideas. Knowledge: selected and organized information that provides understanding, recommendations, and the basis for decisions. Making Decisions BRANCHES OF STATISTICS Descriptive and Inferential Statistics Descriptive Statistics include graphical and numerical procedures that summarize and process data and are used to transform data into information. Making Decisions Descriptive and Inferential Statistics Inferential Statistics provide the bases for predictions, forecasts, and estimates that are used to transform information to knowledge. The Journey to Making Decisions Decision Knowledge Experience, Theory, Literature, Inferential Statistics, Computers Information Descriptive Statistics, Probability, Computers Begin Here: Identify the Problem Data Describing Data © 11-7 Summarizing and Describing Data Tables and Graphs Numerical Measures Frequency Distributions A frequency distribution is a table used to organize data. The left column (called classes or groups) includes numerical intervals on a variable being studied. The right column is a list of the frequencies, or number of observations, for each class. Intervals are normally of equal size, must cover the range of the sample observations, and be non-overlapping. 11-9 Example of a Frequency Distribution A Frequency Distribution for the Shampoo Example Weights (in mL) 220 less than 225 225 less than 230 230 less than 235 235 less than 240 240 less than 245 245 less than 250 Number of Bottles 1 4 29 34 26 6 Cumulative Frequency Distributions A cumulative frequency distribution contains the number of observations whose values are less than the upper limit of each interval. It is constructed by adding the frequencies of all frequency distribution intervals up to and including the present interval. Relative Cumulative Frequency Distributions A relative cumulative frequency distribution converts all cumulative frequencies to cumulative percentages 11-12 Example of a Frequency Distribution A Cumulative Frequency Distribution for the Shampoo Example Weights (in mL) less than 225 less than 230 less than 235 less than 240 less than 245 less than 250 Number of Bottles 1 5 34 68 94 100 Parameters and Statistics A statistic is a descriptive measure computed from a sample of data. A parameter is a descriptive measure computed from an entire population of data. Measures of Central Tendency - Arithmetic Mean A arithmetic mean is of a set of data is the sum of the data values divided by the number of observations. Sample Mean If the data set is from a sample, then the sample mean, X , is: n X x i 1 n i x1 x2 xn n Population Mean If the data set is from a population, then the population mean, , is: N x x1 x2 xn N N i 1 i Measures of Central Tendency - Median An ordered array is an arrangement of data in either ascending or descending order. Once the data are arranged in ascending order, the median is the value such that 50% of the observations are smaller and 50% of the observations are larger. If the sample size n is an odd number, the median, Xm, is the middle observation. If the sample size n is an even number, the median, Xm, is the average of the two middle observations. The median will be located in the 0.50(n+1)th ordered position. Measures of Central Tendency - Mode The mode, if one exists, is the most frequently occurring observation in the sample or population. Shape of the Distribution The shape of the distribution is said to be symmetric if the observations are balanced, or evenly distributed, about the mean. In a symmetric distribution the mean and median are equal. Shape of the Distribution A distribution is skewed if the observations are not symmetrically distributed above and below the mean. A positively skewed (or skewed to the right) distribution has a tail that extends to the right in the direction of positive values. A negatively skewed (or skewed to the left) distribution has a tail that extends to the left in the direction of negative values. Shapes of the Distribution Frequency Symmetric Distribution 10 9 8 7 6 5 4 3 2 1 0 1 2 3 4 5 6 7 8 9 Negatively Skewed Distribution 12 12 10 10 8 8 Frequency Frequency Positively Skewed Distribution 6 4 2 6 4 2 0 0 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 Measures of Variability - TheMEASURES Range - OF DISPERSION The range in a set of data is the difference between the largest and smallest observations Measures of Variability - Sample Variance MOST IMPORTANT MEASURE OF DISPERSION The sample variance, s2, is the sum of the squared differences between each observation and the sample mean divided by the sample size minus 1. n s 2 (x X ) i 1 i n 1 2 Measures of Variability - Short-cut Formulas for Sample Variance Short-cut formulas for the sample variance are: ( xi ) 2 xi n 2 i 1 s n 1 n or s2 2 2 x n X i n 1 Measures of Variability DISTINGUISH BETWEEN POPULATION VARIANCE AND SAMPLE VARIANCE (IMPORTANT) - Population Variance The population variance, 2, is the sum of the squared differences between each observation and the population mean divided by the population size, N. N 2 (x ) i 1 i N 2 Measures of Variability - Sample Standard Deviation The sample standard deviation, s, is the positive square root of the variance, and is defined as: n s s 2 2 ( x X ) i i 1 n 1 Measures of Variability - Population Standard DeviationThe population standard deviation, , is N 2 (x ) i 1 i N 2 11-28 For a set of data with a bell-shaped histogram, the Empirical Rule is: • • • approximately 68% of the observations are contained with a distance of one standard deviation around the mean; 1 approximately 95% of the observations are contained with a distance of two standard deviations around the mean; 2 almost all of the observations are contained with a distance of three standard deviation around the mean; 3 Coefficient of Variation The Coefficient of Variation, CV, is a measure of relative dispersion that expresses the standard deviation as a percentage of the mean (provided the mean is positive). The sample coefficient of variation is s CV 100 X if X 0 CV 100 if 0 The population coefficient of variation is Five-Number Summary The Five-Number Summary refers to the five descriptive measures: minimum, first quartile, median, third quartile, and the maximum. X min imum Q1 Median Q3 X max imum Grouped Data Mean For a population of N observations the mean K is fm i 1 i i N K For a sample of n observations, the mean is X fm i i 1 i n Where the data set contains observation values m1, m2, . . ., mk occurring with frequencies f1, f2, . . . fK respectively Grouped Data Variance For a population of N observations the variance is K K 2 i 1 f i (mi ) 2 N i 1 f i m i2 N 2 K of n observations, K For a sample the variance is s2 i 1 f i (mi X ) 2 n 1 i 1 f i m i2 nX 2 n 1 Where the data set contains observation values m1, m2, . . ., mk occurring with frequencies f1, f2, . . . fK respectively 11-33 1-8 Methods of Displaying Data Pie Charts Bar Graphs Heights of rectangles represent group frequencies Frequency Polygons Categories represented as percentages of total Height of line represents frequency Ogives Height of line represents cumulative 11-34 Pie Chart Figure 1-10: Twentysomethings split on job satisfication Category Don't like my job but it is on my career path Job is OK, but it is not on my career path Enjoy job, but it is not on my career path My job just pays the bills Happy with career 6.0% Do not like my job, but it is on my career path Happy with career 19.0% 33.0% Job OK, but it is not on my career path 19.0% Enjoy job, but it is not on my career path 23.0% My job just pays the bills 11-35 Bar Chart Figure 1-11: SHIFTING GEARS Quartely net income for General Motors (in billions) 1.5 1.2 0.9 0.6 0.3 0.0 1Q 2003 2Q 3Q C4 4Q 1Q 2004 11-36 Frequency Polygon and Ogive Relative Frequency Polygon 0.3 Ogive 1.0 0.2 0.5 0.1 0.0 0.0 0 10 20 Sales 30 40 50 0 10 20 30 40 50 Sales (Cumulative frequency or relative frequency graph) 11-37 Time Plot M o n thly S te e l P ro d uc tio n Millions of Tons 8.5 7.5 6.5 5.5 Month J F M A M J J A S ON D J F M A M J J A S ON D J F M A M J J A S O 11-38 Example 1-8: Stem-and-Leaf Display 1 2 3 4 5 6 122355567 0111222346777899 012457 11257 0236 02 Figure 1-17: Task Performance Times 11-39 Box Plot Elements of a Box Plot Outlier o Smallest data point not below inner fence Largest data point Suspected not exceeding outlier inner fence X Outer Fence Inner Fence Q1-1.5(IQR) Q1-3(IQR) X Q1 Median Interquartile Range Q3 Inner Fence Q3+1.5(IQR) * Outer Fence Q3+3(IQR) Probability 11-41 2 Probability Using Statistics Basic Definitions: Events, Sample Space, and Probabilities Basic Rules for Probability Conditional Probability Independence of Events Combinatorial Concepts The Law of Total Probability and Bayes’ Theorem Joint Probability Table Using the Computer 11-42 2-1 Probability is: A quantitative measure of uncertainty A measure of the strength of belief in the occurrence of an uncertain event A measure of the degree of chance or likelihood of occurrence of an uncertain event Measured by a number between 0 and 1 (or between 0% and 100%) 11-43 Types of Probability Objective or Classical Probability based on equally-likely events based on long-run relative frequency of events not based on personal beliefs is the same for all observers (objective) examples: toss a coin, throw a die, pick a card 11-44 Types of Probability (Continued) Subjective Probability based on personal beliefs, experiences, prejudices, intuition - personal judgment different for all observers (subjective) examples: Super Bowl, elections, new product introduction, snowfall 11-45 2-2 Basic Definitions Set - a collection of elements or objects of interest Empty set (denoted by ) Universal set (denoted by S) a set containing no elements a set containing all possible elements Complement (Not). The complement A of A is a set containing all elements of S not in A 11-46 Complement of a Set S A A Venn Diagram illustrating the Complement of an event 11-47 Basic Definitions (Continued) Intersection (And) A B – a set containing all elements in both A and B Union (Or) – A B a set containing all elements in A or B or both 11-48 Sets: A Intersecting with B S A B A B 11-49 Sets: A Union B S A B A B 11-50 Basic Definitions (Continued) • • Mutually exclusive or disjoint sets –sets having no elements in common, having no intersection, whose intersection is the empty set Partition –a collection of mutually exclusive sets which together include all possible elements, whose union is the universal set 11-51 Mutually Exclusive or Disjoint Sets Sets have nothing in common S A B 11-52 Experiment • Process that leads to one of several possible outcomes *, e.g.: Coin toss • Throw die • • 1, 2, 3, 4, 5, 6 Pick a card • Heads, Tails * Also called a basic outcome, elementary event, or simple event AH, KH, QH, ... Introduce a new product Each trial of an experiment has a single observed outcome. The precise outcome of a random experiment is unknown before a trial. 11-53 Events : Definition Sample Space or Event Set Set of all possible outcomes (universal set) for a given experiment E.g.: Roll a regular six-sided die Event Collection of outcomes having a common characteristic E.g.: Even number S = {1,2,3,4,5,6} A = {2,4,6} Event A occurs if an outcome in the set A occurs Probability of an event Sum of the probabilities of the outcomes of which it consists P(A) = P(2) + P(4) + P(6) Equally-likely Probabilities (Hypothetical or Ideal Experiments) • For example: Throw a die • Six possible outcomes {1,2,3,4,5,6} • If each is equally-likely, the probability of each is 1/6 = 0.1667 = 16.67% 1 P ( e ) n( S ) • Probability of each equally-likely outcome is 1 divided by the number of possible outcomes Event A (even number) • P(A) = P(2) + P(4) + P(6) = 1/6 + 1/6 + 1/6 = 1/2 P( A) • P( e) for e in A n( A ) 3 1 n( S ) 6 2 11-54 11-55 Pick a Card: Sample Space Union of Events ‘Heart’ and ‘Ace’ P ( Heart Ace ) n ( Heart Ace ) n(S ) 16 4 52 Hearts Diamonds Clubs A K Q J 10 9 8 7 6 5 4 3 2 A K Q J 10 9 8 7 6 5 4 3 2 A K Q J 10 9 8 7 6 5 4 3 2 Spades A K Q J 10 9 8 7 6 5 4 3 2 Event ‘Ace’ n ( Ace ) P ( Ace ) 4 n(S ) 1 52 13 13 Event ‘Heart’ n ( Heart ) P ( Heart ) 13 n(S ) 1 52 The intersection of the events ‘Heart’ and ‘Ace’ comprises the single point circled twice: the ace of hearts 4 n ( Heart Ace ) P ( Heart Ace ) 1 n(S ) 52 11-56 2-3 Basic Rules for Probability Range of Values for P(A): 0 P( A) 1 Complements - Probability of not A P( A ) 1 P( A) Intersection - Probability of both A and B P( A B) n( A B) n( S ) Mutually exclusive events (A and C) : P( A C ) 0 Basic Rules for Probability (Continued) • Union - Probability of A or B or both (rule of unions) P( A B) n( A B) P( A) P( B) P( A B) n( S ) Mutually exclusive events: If A and B are mutually exclusive, then P( A B) 0 so P( A B) P( A) P( B) 11-57 11-58 Sets: P(A Union B) S A B P( A B ) 11-59 2-4 Conditional Probability • Conditional Probability - Probability of A given B P( A B) P( A B) , where P( B) 0 P( B) Independent events: P( A B) P( A) P( B A) P( B) 11-60 Conditional Probability (continued) Rules of conditional probability: P( A B) P( A B) so P( A B) P( A B) P( B) P( B) P( B A) P( A) If events A and D are statistically independent: P ( A D) P ( A) P ( D A) P ( D) so P( A D) P( A) P( D) 11-61 Contingency Table - Example 2-2 Counts AT& T IBM Total Telecommunication 40 10 50 Computers 20 30 50 Total 60 40 100 Probabilities AT& T IBM Total Telecommunication .40 .10 .50 Computers .20 .30 .50 Total .60 .40 1.00 Probability that a project is undertaken by IBM given it is a telecommunications project: P ( IBM T ) P (T ) 0.10 0. 2 0.50 P ( IBM T ) 11-62 2-5 Independence of Events Conditions for the statistical independence of events A and B: P( A B) P( A) P( B A) P( B) and P( A B) P( A) P( B) P ( Ace Heart ) P ( Heart ) 1 1 52 P ( Ace ) 13 13 52 P ( Ace Heart ) P ( Heart Ace ) P ( Ace ) 1 1 52 P ( Heart ) 4 4 52 P ( Heart Ace ) 4 13 1 P( Ace Heart) * P( Ace) P( Heart) 52 52 52 Independence of Events – Example 2-5 Events Television (T) and Billboard (B) are assumed to be independent. a)P(T B) P(T ) P( B) 0.04 * 0.06 0.0024 b)P(T B) P(T ) P( B) P(T B) 0.04 0.06 0.0024 0.0976 11-63 Product Rules for Independent Events 11-64 The probability of the intersection of several independent events is the product of their separate individual probabilities: P( A A A An ) P( A ) P( A ) P( A ) P( An ) 1 2 3 1 2 3 The probability of the union of several independent events is 1 minus the product of probabilities of their complements: P( A A A An ) 1 P( A ) P( A ) P( A ) P( An ) 1 2 3 1 2 3 Example 2-7: (Q Q Q Q ) 1 P(Q )P(Q )P(Q )P(Q ) 1 2 3 10 1 2 3 10 10.9010 10.3487 0.6513 11-65 2-6 Combinatorial Concepts Consider a pair of six-sided dice. There are six possible outcomes from throwing the first die {1,2,3,4,5,6} and six possible outcomes from throwing the second die {1,2,3,4,5,6}. Altogether, there are 6*6 = 36 possible outcomes from throwing the two dice. In general, if there are n events and the event i can happen in Ni possible ways, then the number of ways in which the sequence of n events may occur is N1N2...Nn. Pick 5 cards from a deck of 52 - with replacement 52*52*52*52*52=525 380,204,032 different possible outcomes Pick 5 cards from a deck of 52 - without replacement 52*51*50*49*48 = 311,875,200 different possible outcomes 11-66 Factorial How many ways can you order the 3 letters A, B, and C? There are 3 choices for the first letter, 2 for the second, and 1 for the last, so there are 3*2*1 = 6 possible ways to order the three letters A, B, and C. How many ways are there to order the 6 letters A, B, C, D, E, and F? (6*5*4*3*2*1 = 720) Factorial: For any positive integer n, we define n factorial as: n(n-1)(n-2)...(1). We denote n factorial as n!. The number n! is the number of ways in which n objects can be ordered. By definition 1! = 1 and 0! = 1. 11-67 Permutations (Order is important) What if we chose only 3 out of the 6 letters A, B, C, D, E, and F? There are 6 ways to choose the first letter, 5 ways to choose the second letter, and 4 ways to choose the third letter (leaving 3 letters unchosen). That makes 6*5*4=120 possible orderings or permutations. Permutations are the possible ordered selections of r objects out of a total of n objects. The number of permutations of n objects taken r at a time is denoted by nPr, where n! P n r (n r )! Forexam ple: 6! 6! 6 * 5 * 4 * 3 * 2 *1 6 * 5 * 4 120 6 P3 (6 3)! 3! 3 * 2 *1 Combinations (Order is not Important) 11-68 Suppose that when we pick 3 letters out of the 6 letters A, B, C, D, E, and F we chose BCD, or BDC, or CBD, or CDB, or DBC, or DCB. (These are the 6 (3!) permutations or orderings of the 3 letters B, C, and D.) But these are orderings of the same combination of 3 letters. How many combinations of 6 different letters, taking 3 at a time, are there? Combinations are the possible selections of r items from a group of n items n regardless of the order of selection. The number of combinations is denoted r and is read as n choose r. An alternative notation is nCr. We define the number of combinations of r out of n elements as: n n! n C r r! (n r)! r Forexam ple: n 6! 6! 6 * 5 * 4 * 3 * 2 *1 6 * 5 * 4 120 6 C3 20 r 3 ! ( 6 3 )! 3 ! 3 ! (3 * 2 * 1)(3 * 2 * 1) 3 * 2 * 1 6 2-7 The Law of Total Probability and Bayes’ Theorem 11-69 The law of total probability: P( A) P( A B) P( A B ) In terms of conditional probabilities: P( A) P( A B) P( A B ) P( A B) P( B) P( A B ) P( B ) More generally (where Bi make up a partition): P( A) P( A B ) i P( AB ) P( B ) i i 11-70 Bayes’ Theorem • • Bayes’ theorem enables you, knowing just a little more than the probability of A given B, to find the probability of B given A. Based on the definition of conditional probability and the law of total probability. P ( A B) P ( A) P ( A B) P ( A B) P ( A B ) P ( A B) P ( B) P ( A B) P ( B) P ( A B ) P ( B ) P ( B A) Applying the law of total probability to the denominator Applying the definition of conditional probability throughout 11-71 11-72 2-8 The Joint Probability Table A joint probability table is similar to a contingency table , except that it has probabilities in place of frequencies. The joint probability for Example 2-11 is shown below. The row totals and column totals are called marginal probabilities. 11-73 The Joint Probability Table A joint probability table is similar to a contingency table , except that it has probabilities in place of frequencies. The row totals and column totals are called marginal probabilities. The Joint Probability Table: 11-74 The joint probability table is summarized below. GROWTH High Medium Low Total Appreciates ( Re) 0.21 0.2 0.04 0.45 Depreciates 0.09 0.3 0.16 0.55 0.30 0.5 0.20 1.00 (Re) Total Marginal probabilities are the row totals and the column totals. Random Variables 11-76 3-1 Using Statistics Consider the different possible orderings of boy (B) and girl (G) in four sequential births. There are 2*2*2*2=24 = 16 possibilities, so the sample space is: BBBB BBBG BBGB BBGG BGBB BGBG BGGB BGGG GBBB GBBG GBGB GBGG GGBB GGBG GGGB GGGG If girl and boy are each equally likely [P(G) = P(B) = 1/2], and the gender of each child is independent of that of the previous child, then the probability of each of these 16 possibilities is: (1/2)(1/2)(1/2)(1/2) = 1/16. 11-77 Random Variables (Continued) 0 BBBB BGBB GBBB BBBG BBGB GGBB GBBG BGBG BGGB GBGB BBGG BGGG GBGG GGGB GGBG GGGG 1 X 2 3 4 Sample Space Points on the Real Line 11-78 Random Variables (Continued) Since the random variable X = 3 when any of the four outcomes BGGG, GBGG, GGBG, or GGGB occurs, P(X = 3) = P(BGGG) + P(GBGG) + P(GGBG) + P(GGGB) = 4/16 The probability distribution of a random variable is a table that lists the possible values of the random variables and their associated probabilities. x 0 1 2 3 4 P(x) 1/16 4/16 6/16 4/16 1/16 16/16=1 The Graphical Display for this Probability Distribution is shown on the next Slide. 11-79 Random Variables (Continued) Probability Distribution of the Number of Girls in Four Births 0.4 6/16 Probability, P(X) 0.3 4/16 4/16 0.2 0.1 1/16 0.0 0 1/16 1 2 Number of Girls, X 3 4 11-80 Example 3-1 Consider the experiment of tossing two six-sided dice. There are 36 possible outcomes. Let the random variable X represent the sum of the numbers on the two dice: x P(x)* Probability Distribution of Sum of Two Dice 3 1,3 2,3 3,3 4,3 5,3 6,3 4 1,4 2,4 3,4 4,4 5,4 6,4 5 1,5 2,5 3,5 4,5 5,5 6,5 6 1,6 2,6 3,6 4,6 5,6 6,6 7 8 9 10 11 12 1/36 2/36 3/36 4/36 5/36 6/36 5/36 4/36 3/36 2/36 1/36 1 0.17 0.12 p(x) 1,1 2,1 3,1 4,1 5,1 6,1 2 1,2 2,2 3,2 4,2 5,2 6,2 2 3 4 5 6 7 8 9 10 11 12 0.07 0.02 2 3 4 5 6 7 8 9 10 x * Note that: P(x) (6 (7 x) 2 ) / 36 11 12 11-81 Example 3-2 Probability Distribution of the Number of Switches The Probability Distribution of the Number of Switches P(x) 0.1 0.2 0.3 0.2 0.1 0.1 1 0.4 0.3 P(x) x 0 1 2 3 4 5 0.2 0.1 0.0 0 1 2 3 4 5 x Probability of more than 2 switches: P(X > 2) = P(3) + P(4) + P(5) = 0.2 + 0.1 + 0.1 = 0.4 Probability of at least 1 switch: P(X 1) = 1 - P(0) = 1 - 0.1 = .9 Discrete and Continuous Random Variables 11-82 A discrete random variable: has a countable number of possible values has discrete jumps (or gaps) between successive values has measurable probability associated with individual values counts A continuous random variable: has an uncountably infinite number of possible values moves continuously from value to value has no measurable probability associated with each value measures (e.g.: height, weight, speed, value, duration, length) Rules of Discrete Probability Distributions 11-83 The probability distribution of a discrete random variable X must satisfy the following two conditions. 1. P(x) 0 for all values of x. 2. P(x) 1 all x Corollary: 0 P( X ) 1 11-84 Cumulative Distribution Function The cumulative distribution function, F(x), of a discrete random variable X is: F(x) P( X x) P(i) all i x P(x) 0.1 0.2 0.3 0.2 0.1 0.1 1.00 F(x) 0.1 0.3 0.6 0.8 0.9 1.0 Cumulative Probability Distribution of the Number of Switches 1 .0 0 .9 0 .8 0 .7 F(x) x 0 1 2 3 4 5 0 .6 0 .5 0 .4 0 .3 0 .2 0 .1 0 .0 0 1 2 3 x 4 5 11-85 Cumulative Distribution Function The probability that at most three switches will occur: x 0 1 2 3 4 5 P(x) 0.1 0.2 0.3 0.2 0.1 0.1 1 F(x) 0.1 0.3 0.6 0.8 0.9 1.0 Note: P(X < 3) = F(3) = 0.8 = P(0) + P(1) + P(2) + P(3) Using Cumulative Probability Distributions (Figure 3-8) The probability that more than one switch will occur: x 0 1 2 3 4 5 P(x) 0.1 0.2 0.3 0.2 0.1 0.1 1 F(x) 0.1 0.3 0.6 0.8 0.9 1.0 Note: P(X > 1) = P(X > 2) = 1 – P(X < 1) = 1 – F(1) = 1 – 0.3 = 0.7 11-86 Using Cumulative Probability Distributions (Figure 3-9) The probability that anywhere from one to three switches will occur: x 0 1 2 3 4 5 P(x) 0.1 0.2 0.3 0.2 0.1 0.1 1 F(x) 0.1 0.3 0.6 0.8 0.9 1.0 Note: P(1 < X < 3) = P(X < 3) – P(X < 0) = F(3) – F(0) = 0.8 – 0.1 = 0.7 11-87 3-2 Expected Values of Discrete Random Variables The mean of a probability distribution is a measure of its centrality or location, as is the mean or average of a frequency distribution. It is a weighted average, with the values of the random variable weighted by their probabilities. 0 1 2 3 4 11-88 5 2.3 The mean is also known as the expected value (or expectation) of a random variable, because it is the value that is expected to occur, on average. The expected value of a discrete random variable X is equal to the sum of each value of the random variable multiplied by its probability. E ( X ) xP ( x ) all x x 0 1 2 3 4 5 P(x) 0.1 0.2 0.3 0.2 0.1 0.1 1.0 xP(x) 0.0 0.2 0.6 0.6 0.4 0.5 2.3 = E(X) = Expected Value of a Function of a Discrete Random Variables 11-89 The expected value of a function of a discrete random variable X is: E [ h ( X )] h ( x ) P ( x ) all x Example 3-3: Monthly sales of a certain product are believed to follow the given probability distribution. Suppose the company has a fixed monthly production cost of $8000 and that each item brings $2. Find the expected monthly profit h(X), from product sales. E [ h ( X )] h ( x ) P ( x ) 5400 all x Number of items, x 5000 6000 7000 8000 9000 P(x) 0.2 0.3 0.2 0.2 0.1 1.0 xP(x) h(x) h(x)P(x) 1000 2000 400 1800 4000 1200 1400 6000 1200 1600 8000 1600 900 10000 1000 6700 5400 Note: h (X) = 2X – 8000 where X = # of items sold The expected value of a linear function of a random variable is: E(aX+b)=aE(X)+b In this case: E(2X-8000)=2E(X)-8000=(2)(6700)-8000=5400 Variance and Standard Deviation of a Random Variable 11-90 The variance of a random variable is the expected squared deviation from the mean: 2 V ( X ) E [( X ) 2 ] (x ) 2 P(x) a ll x E ( X 2 ) [ E ( X )] 2 x 2 P ( x ) xP ( x ) a ll x a ll x 2 The standard deviation of a random variable is the square root of its variance: SD( X ) V ( X ) Random Variable – using Example 32 11-91 2 V ( X ) E[( X )2] Table 3-8 Number of Switches, x 0 1 2 3 4 5 P(x) xP(x) 0.1 0.0 0.2 0.2 0.3 0.6 0.2 0.6 0.1 0.4 0.1 0.5 2.3 Recall: = 2.3. (x-) -2.3 -1.3 -0.3 0.7 1.7 2.7 (x-)2 P(x-)2 5.29 0.529 1.69 0.338 0.09 0.027 0.49 0.098 2.89 0.289 7.29 0.729 2.010 x2P(x) 0.0 0.2 1.2 1.8 1.6 2.5 7.3 ( x )2 P( x) 2.01 all x E( X 2) [ E( X )]2 2 x2 P( x) xP( x) all x all x 7.3 2.32 2.01 Variance of a Linear Function of a Random Variable The variance of a linear function of a random variable is: V(aX b) a2V( X) a22 Example 33: Number of items, x P(x) 5000 0.2 6000 0.3 7000 0.2 8000 0.2 9000 0.1 1.0 2 V(X) xP(x) 1000 1800 1400 1600 900 6700 x2 P(x) 5000000 10800000 9800000 12800000 8100000 46500000 E ( X 2 ) [ E ( X )]2 2 2 x P( x ) xP( x ) all x all x 46500000 ( 67002 ) 1610000 SD( X ) 1610000 1268.86 V (2 X 8000) (2 2 )V ( X ) ( 4)(1610000) 6440000 ( 2 x 8000) SD(2 x 8000) 2 x (2)(1268.86) 2537.72 11-92 Some Properties of Means and Variances of Random Variables The mean or expected value of the sum of random variables is the sum of their means or expected values: ( XY) E( X Y) E( X) E(Y) X Y For example: E(X) = $350 and E(Y) = $200 E(X+Y) = $350 + $200 = $550 The variance of the sum of mutually independent random variables is the sum of their variances: 2 ( X Y ) V ( X Y) V ( X ) V (Y) 2 X 2 Y if and only if X and Y are independent. For example: V(X) = 84 and V(Y) = 60 V(X+Y) = 144 11-93 Some Properties of Means and Variances of Random Variables NOTE: E( X X ... X ) E( X ) E( X ) ... E( X ) 1 2 1 2 k k E(a X a X ... a X ) a E( X ) a E( X ) ... a E( X ) 1 1 2 2 1 2 2 k k 1 k k The variance of the sum of k mutually independent random variables is the sum of their variances: V ( X X ... X ) V ( X ) V ( X ) ...V ( X ) 1 2 1 2 k k and V (a X a X ... a X ) a2V ( X ) a2V ( X ) ... a2V ( X ) 1 1 2 2 1 2 2 k k 1 k k 11-94 Chebyshev’s Theorem Applied to Probability Distributions Chebyshev’s Theorem applies to probability distributions just as it applies to frequency distributions. For a random variable X with mean ,standard deviation , and for any number k > 1: 1 P( X k) 1 2 k 1 1 1 3 1 75% 2 4 4 2 At 1 12 1 1 8 89% 9 9 3 least 1 1 1 15 1 94% 2 16 16 4 2 Lie within 3 4 Standard deviations of the mean 11-95 11-96 3-3 Bernoulli Random Variable • If an experiment consists of a single trial and the outcome of the trial can only be either a success* or a failure, then the trial is called a Bernoulli trial. • The number of success X in one Bernoulli trial, which can be 1 or 0, is a Bernoulli random variable. • Note: If p is the probability of success in a Bernoulli experiment, the E(X) = p and V(X) = p(1 – p). * The terms success and failure are simply statistical terms, and do not have positive or negative implications. In a production setting, finding a defective product may be termed a “success,” although it is not a positive result. 11-97 3-4 The Binomial Random Variable Consider a Bernoulli Process in which we have a sequence of n identical trials satisfying the following conditions: 1. Each trial has two possible outcomes, called success *and failure. The two outcomes are mutually exclusive and exhaustive. 2. The probability of success, denoted by p, remains constant from trial to trial. The probability of failure is denoted by q, where q = 1-p. 3. The n trials are independent. That is, the outcome of any trial does not affect the outcomes of the other trials. A random variable, X, that counts the number of successes in n Bernoulli trials, where p is the probability of success* in any given trial, is said to follow the binomial probability distribution with parameters n (number of trials) and p (probability of success). We call X the binomial random variable. * The terms success and failure are simply statistical terms, and do not have positive or negative implications. In a production setting, finding a defective product may be termed a “success,” although it is not a positive result. 11-98 Binomial Probabilities (Introduction) Suppose we toss a single fair and balanced coin five times in succession, and let X represent the number of heads. There are 25 = 32 possible sequences of H and T (S and F) in the sample space for this experiment. Of these, there are 10 in which there are exactly 2 heads (X=2): HHTTT HTHTH HTTHT HTTTH THHTT THTHT THTTH TTHHT TTHTH TTTHH The probability of each of these 10 outcomes is p3q3 = (1/2)3(1/2)2=(1/32), so the probability of 2 heads in 5 tosses of a fair and balanced coin is: P(X = 2) = 10 * (1/32) = (10/32) = 0.3125 10 (1/32) Number of outcomes with 2 heads Probability of each outcome with 2 heads 11-99 Binomial Probabilities (continued) P(X=2) = 10 * (1/32) = (10/32) = .3125 Notice that this probability has two parts: 10 (1/32) Number of outcomes with 2 heads Probability of each outcome with 2 heads In general: 1. The probability of a given sequence of x successes out of n trials with probability of success p and probability of failure q is equal to: pxq(n-x) 2. The number of different sequences of n trials that result in exactly x successes is equal to the number of choices of x elements out of a total of n elements. This number is denoted: n! n nCx x x!( n x)! 11-100 The Binomial Probability Distribution The binomial probability distribution: n! n x ( n x ) P( x) p q p x q ( n x) x!( n x)! x where : p is the probability of success in a single trial, q = 1-p, n is the number of trials, and x is the number of successes. N u m b er o f su ccesses, x 0 1 2 3 n P ro b ab ility P (x ) n! p 0 q (n 0) 0 !( n 0 ) ! n! p 1 q ( n 1) 1 !( n 1 ) ! n! p 2 q (n 2) 2 !( n 2 ) ! n! p 3 q (n 3) 3 !( n 3 ) ! n! p n q (n n) n !( n n ) ! 1 .0 0 The Normal Distribution 11-102 4-1 Introduction As n increases, the binomial distribution approaches a ... n=6 n = 10 Binomial Distribution: n=10, p=.5 Binomial Distribution: n=14, p=.5 0.3 0.3 0.2 0.2 0.2 0.1 P(x) 0.3 P(x) P(x) Binomial Distribution: n=6, p=.5 n = 14 0.1 0.0 0.1 0.0 0 1 2 3 4 5 6 0.0 0 x 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 x x Normal Probability Density Function: 1 0.4 0.3 for x 2p 2 where e 2 . 7182818 ... and p 3 . 14159265 ... f(x) f ( x) 2 x e 2 2 Normal Distribution: = 0, = 1 0.2 0.1 0.0 -5 0 x 5 11-103 The Normal Probability Distribution The normal probability density function: 2 x e 2 2 for f (x) 1 0.4 0.3 x 2 p 2 where e 2 .7182818 ... and p 3.14159265 ... f(x) Normal Distribution: = 0, = 1 0.2 0.1 0.0 -5 0 x 5 4-2 Properties of the Normal Distribution • The normal is a family of Bell-shaped and symmetric distributions. because the distribution is symmetric, one-half (.50 or 50%) lies on either side of the mean. Each is characterized by a different pair of mean, , and variance, . That is: [X~N(,)]. Each is asymptotic to the horizontal axis. The area under any normal probability density function within k of is the same for any normal distribution, regardless of the mean and variance. 11-104 4-2 Properties of the Normal Distribution (continued) • • • If several independent random variables are normally distributed then their sum will also be normally distributed. The mean of the sum will be the sum of all the individual means. The variance of the sum will be the sum of all the individual variances (by virtue of the independence). 11-105 4-2 Properties of the Normal Distribution (continued) • If X1, X2, …, Xn are independent normal random variable, then their sum S will also be normally distributed with • E(S) = E(X1) + E(X2) + … + E(Xn) • V(S) = V(X1) + V(X2) + … + V(Xn) • Note: It is the variances that can be added above and not the standard deviations. 11-106 4-2 Properties of the Normal Distribution – Example 4-1 Example 4.1: Let X1, X2, and X3 be independent random variables that are normally distributed with means and variances as shown. X1 X2 X3 Mean Variance 10 1 20 2 30 3 Let S = X1 + X2 + X3. Then E(S) = 10 + 20 + 30 = 60 and V(S) = 1 + 2 + 3 = 6. The standard deviation of S is 6 = 2.45. 11-107 4-2 Properties of the Normal Distribution (continued) • If X1, X2, …, Xn are independent normal random variable, then the random variable Q defined as Q = a1X1 + a2X2 + … + anXn + b will also be normally distributed with • E(Q) = a1E(X1) + a2E(X2) + … + anE(Xn) + b • V(Q) = a12 V(X1) + a22 V(X2) + … + an2 V(Xn) • Note: It is the variances that can be added above and not the standard deviations. 11-108 4-2 Properties of the Normal Distribution – Example 4-3 Example 4.3: Let X1 , X2 , X3 and X4 be independent random variables that are normally distributed with means and variances as shown. Find the mean and variance of Q = X1 - 2X2 + 3X2 - 4X4 + 5 Mean Variance X1 12 4 X2 -5 2 X3 8 5 X4 10 1 E(Q) = 12 – 2(-5) + 3(8) – 4(10) + 5 = 11 V(Q) = 4 + (-2)2(2) + 32(5) + (-4)2(1) = 73 SD(Q) = 73 8.544 11-109 Computing the Mean, Variance and Standard 11-110 Deviation for the Sum of Independent Random Variables Using the Template 11-111 Normal Probability Distributions All of these are normal probability density functions, though each has a different mean and variance. Normal Distribution: =40, =1 Normal Distribution: =30, =5 0.4 Normal Distribution: =50, =3 0.2 0.2 0.2 f(y) f(x) f(w) 0.3 0.1 0.1 0.1 0.0 0.0 35 40 45 0.0 0 10 20 30 w 40 x W~N(40,1) X~N(30,25) 50 60 35 45 50 55 y Y~N(50,9) Normal Distribution: =0, =1 Consider: 0.4 f(z) 0.3 0.2 0.1 0.0 -5 0 z Z~N(0,1) 5 P(39 W 41) P(25 X 35) P(47 Y 53) P(-1 Z 1) The probability in each case is an area under a normal probability density function. 65 4-4 The Standard Normal Distribution 11-112 The standard normal random variable, Z, is the normal random variable with mean = 0 and standard deviation = 1: Z~N(0,12). Standard Normal Distribution 0 .4 =1 { f(z) 0 .3 0 .2 0 .1 0 .0 -5 -4 -3 -2 -1 0 =0 Z 1 2 3 4 5 Finding Probabilities of the Standard Normal Distribution: P(0 < Z < 1.56) 11-113 Standard Normal Probabilities Standard Normal Distribution 0.4 f(z) 0.3 0.2 0.1 { 1.56 0.0 -5 -4 -3 -2 -1 0 1 2 3 4 5 Z Look in row labeled 1.5 and column labeled .06 to find P(0 z 1.56) = 0.4406 z 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.0 .00 0.0000 0.0398 0.0793 0.1179 0.1554 0.1915 0.2257 0.2580 0.2881 0.3159 0.3413 0.3643 0.3849 0.4032 0.4192 0.4332 0.4452 0.4554 0.4641 0.4713 0.4772 0.4821 0.4861 0.4893 0.4918 0.4938 0.4953 0.4965 0.4974 0.4981 0.4987 .01 0.0040 0.0438 0.0832 0.1217 0.1591 0.1950 0.2291 0.2611 0.2910 0.3186 0.3438 0.3665 0.3869 0.4049 0.4207 0.4345 0.4463 0.4564 0.4649 0.4719 0.4778 0.4826 0.4864 0.4896 0.4920 0.4940 0.4955 0.4966 0.4975 0.4982 0.4987 .02 0.0080 0.0478 0.0871 0.1255 0.1628 0.1985 0.2324 0.2642 0.2939 0.3212 0.3461 0.3686 0.3888 0.4066 0.4222 0.4357 0.4474 0.4573 0.4656 0.4726 0.4783 0.4830 0.4868 0.4898 0.4922 0.4941 0.4956 0.4967 0.4976 0.4982 0.4987 .03 0.0120 0.0517 0.0910 0.1293 0.1664 0.2019 0.2357 0.2673 0.2967 0.3238 0.3485 0.3708 0.3907 0.4082 0.4236 0.4370 0.4484 0.4582 0.4664 0.4732 0.4788 0.4834 0.4871 0.4901 0.4925 0.4943 0.4957 0.4968 0.4977 0.4983 0.4988 .04 0.0160 0.0557 0.0948 0.1331 0.1700 0.2054 0.2389 0.2704 0.2995 0.3264 0.3508 0.3729 0.3925 0.4099 0.4251 0.4382 0.4495 0.4591 0.4671 0.4738 0.4793 0.4838 0.4875 0.4904 0.4927 0.4945 0.4959 0.4969 0.4977 0.4984 0.4988 .05 0.0199 0.0596 0.0987 0.1368 0.1736 0.2088 0.2422 0.2734 0.3023 0.3289 0.3531 0.3749 0.3944 0.4115 0.4265 0.4394 0.4505 0.4599 0.4678 0.4744 0.4798 0.4842 0.4878 0.4906 0.4929 0.4946 0.4960 0.4970 0.4978 0.4984 0.4989 .06 0.0239 0.0636 0.1026 0.1406 0.1772 0.2123 0.2454 0.2764 0.3051 0.3315 0.3554 0.3770 0.3962 0.4131 0.4279 0.4406 0.4515 0.4608 0.4686 0.4750 0.4803 0.4846 0.4881 0.4909 0.4931 0.4948 0.4961 0.4971 0.4979 0.4985 0.4989 .07 0.0279 0.0675 0.1064 0.1443 0.1808 0.2157 0.2486 0.2794 0.3078 0.3340 0.3577 0.3790 0.3980 0.4147 0.4292 0.4418 0.4525 0.4616 0.4693 0.4756 0.4808 0.4850 0.4884 0.4911 0.4932 0.4949 0.4962 0.4972 0.4979 0.4985 0.4989 .08 0.0319 0.0714 0.1103 0.1480 0.1844 0.2190 0.2517 0.2823 0.3106 0.3365 0.3599 0.3810 0.3997 0.4162 0.4306 0.4429 0.4535 0.4625 0.4699 0.4761 0.4812 0.4854 0.4887 0.4913 0.4934 0.4951 0.4963 0.4973 0.4980 0.4986 0.4990 .09 0.0359 0.0753 0.1141 0.1517 0.1879 0.2224 0.2549 0.2852 0.3133 0.3389 0.3621 0.3830 0.4015 0.4177 0.4319 0.4441 0.4545 0.4633 0.4706 0.4767 0.4817 0.4857 0.4890 0.4916 0.4936 0.4952 0.4964 0.4974 0.4981 0.4986 0.4990 Finding Probabilities of the Standard Normal Distribution: P(Z < -2.47) 11-114 To find P(Z<-2.47): z ... . . P(0 < Z < 2.47) = .4932 . < -2.47) = .5 - P(0 < Z2.3< ...2.47) 0.4909 2.4 ... 0.4931 = .5 - .4932 = 0.0068 2.5 ... 0.4948 . . . Find table area for 2.47 P(Z .06 . . . 0.4911 0.4932 0.4949 .07 . . . 0.4913 0.4934 0.4951 Standard Normal Distribution Area to the left of -2.47 P(Z < -2.47) = .5 - 0.4932 = 0.0068 0.4 Table area for 2.47 P(0 < Z < 2.47) = 0.4932 f(z) 0.3 0.2 0.1 0.0 -5 -4 -3 -2 -1 0 Z 1 2 3 4 5 .08 . . . Finding Probabilities of the Standard Normal Distribution: P(1< Z < 2) 11-115 To find P(1 Z 2): 1. Find table area for 2.00 F(2) = P(Z 2.00) = .5 + .4772 =.9772 2. Find table area for 1.00 F(1) = P(Z 1.00) = .5 + .3413 = .8413 3. P(1 Z 2.00) = P(Z 2.00) - P(Z 1.00) z . . . 0.9 1.0 1.1 . . . 1.9 2.0 2.1 . . . = .9772 - .8413 = 0.1359 Standard Normal Distribution 0.4 Area between 1 and 2 P(1 Z 2) = .9772 - .8413 = 0.1359 f(z) 0.3 0.2 0.1 0.0 -5 -4 -3 -2 -1 0 Z 1 2 3 4 5 .00 . . . 0.3159 0.3413 0.3643 . . . 0.4713 0.4772 0.4821 . . . ... ... ... ... ... ... ... Finding Values of the Standard Normal Random Variable: P(0 < Z < z) = 0.40 To find z such that P(0 Z z) = .40: 1. Find a probability as close as possible to .40 in the table of standard normal probabilities. z 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 . . . .00 0.0000 0.0398 0.0793 0.1179 0.1554 0.1915 0.2257 0.2580 0.2881 0.3159 0.3413 0.3643 0.3849 0.4032 . . . .01 0.0040 0.0438 0.0832 0.1217 0.1591 0.1950 0.2291 0.2611 0.2910 0.3186 0.3438 0.3665 0.3869 0.4049 . . . 2. Then determine the value of z from the corresponding row and column. Area to the left of 0 = .50 Also, since P(Z 0) = .50 .03 0.0120 0.0517 0.0910 0.1293 0.1664 0.2019 0.2357 0.2673 0.2967 0.3238 0.3485 0.3708 0.3907 0.4082 . . . .04 0.0160 0.0557 0.0948 0.1331 0.1700 0.2054 0.2389 0.2704 0.2995 0.3264 0.3508 0.3729 0.3925 0.4099 . . . .05 0.0199 0.0596 0.0987 0.1368 0.1736 0.2088 0.2422 0.2734 0.3023 0.3289 0.3531 0.3749 0.3944 0.4115 . . . .06 0.0239 0.0636 0.1026 0.1406 0.1772 0.2123 0.2454 0.2764 0.3051 0.3315 0.3554 0.3770 0.3962 0.4131 . . . .07 0.0279 0.0675 0.1064 0.1443 0.1808 0.2157 0.2486 0.2794 0.3078 0.3340 0.3577 0.3790 0.3980 0.4147 . . . .08 0.0319 0.0714 0.1103 0.1480 0.1844 0.2190 0.2517 0.2823 0.3106 0.3365 0.3599 0.3810 0.3997 0.4162 . . . Standard Normal Distribution 0.4 P(z 0) = .50 Area = .40 (.3997) 0.3 f(z) P(0 Z 1.28) .40 .02 0.0080 0.0478 0.0871 0.1255 0.1628 0.1985 0.2324 0.2642 0.2939 0.3212 0.3461 0.3686 0.3888 0.4066 . . . 0.2 0.1 0.0 P(Z 1.28) .90 -5 -4 -3 -2 -1 0 Z 1 2 3 4 5 Z = 1.28 11-116 .09 0.0359 0.0753 0.1141 0.1517 0.1879 0.2224 0.2549 0.2852 0.3133 0.3389 0.3621 0.3830 0.4015 0.4177 . . . 99% Interval around the Mean To have .99 in the center of the distribution, there should be (1/2)(1-.99) = (1/2)(.01) = .005 in each tail of the distribution, and (1/2)(.99) = .495 in each half of the .99 interval. That is: P(0 Z z.005) = .495 z . . . 2.4 ... 2.5 ... 2.6 ... . . . .04 . . . 0.4927 0.4945 0.4959 . . . .05 . . . 0.4929 0.4946 0.4960 . . . Look to the table of standard normal probabilities Area in center left = .495 to find that: .06 . . . 0.4931 0.4948 0.4961 . . . .07 . . . 0.4932 0.4949 0.4962 . . . 11-117 .08 . . . 0.4934 0.4951 0.4963 . . . .09 . . . 0.4936 0.4952 0.4964 . . . Total area in center = .99 0.4 z.005 z.005 P(-.2575 Z ) = .99 Area in center right = .495 f(z) 0.3 0.2 Area in right tail = .005 Area in left tail = .005 0.1 0.0 -5 -4 -3 -2 -z.005 -2.575 -1 0 Z 1 2 3 z.005 2.575 4 5 4-5 The Transformation of Normal Random Variables 11-118 The area within k of the mean is the same for all normal random variables. So an area under any normal distribution is equivalent to an area under the standard normal. In this example: P(40 X P(-1 Z since and The transformation of X to Z: X x Z Normal Distribution: =50, =10 x 0.07 0.06 Transformation f(x) (1) Subtraction: (X - x) 0.05 0.04 0.03 =10 { 0.02 Standard Normal Distribution 0.01 0.00 0.4 0 20 30 40 50 60 70 80 90 100 X 0.3 0.2 (2) Division by x) { f(z) 10 1.0 0.1 0.0 -5 -4 -3 -2 -1 0 Z 1 2 3 4 5 The inverse transformation of Z to X: X x Z x 11-119 Using the Normal Transformation Example 4-9 Example 4-10 X~N(160,302) X~N(127,222) P (100 X 180) 100 X 180 P P( X 150) X 150 P 100 160 180 160 P Z 30 30 P 2 Z .6667 0.4772 0.2475 0.7247 150 127 P Z 22 P Z 1.045 0.5 0.3520 0.8520 Using the Normal Transformation Example 4-11 11-120 Normal Distribution: = 383, = 12 Example 4-11 0.05 0.04 X~N(383,122) 0.03 P 0.9166 Z 1.333 0.4088 0.3203 0.0885 Template solution 0.02 0.01 Standard Normal Distribution 0.00 340 0.4 390 X 0.3 f(z) 394 383 399 383 P Z 12 12 f(X) P ( 394 X 399) 394 X 399 P 0.2 0.1 0.0 -5 -4 -3 -2 -1 0 Z 1 2 3 4 5 440 The Transformation of Normal Random Variables The transformation of X to Z: Z X x x The inverse transformation of Z to X: X Z x x The transformation of X to Z, where a and b are numbers:: a P ( X a ) P Z b P ( X b) P Z b a P (a X b ) P Z 11-121 11-122 Normal Probabilities (Empirical Rule) • The probability that a normal random S ta n d a rd N o rm a l D is trib u tio n variable will be within 1 standard deviation from its mean (on either side) is 0.6826, or approximately 0.68. variable will be within 2 standard deviations from its mean is 0.9544, or approximately 0.95. • The probability that a normal random variable will be within 3 standard deviation from its mean is 0.9974. 0.3 f(z) • The probability that a normal random 0.4 0.2 0.1 0.0 -5 -4 -3 -2 -1 0 Z 1 2 3 4 5 11-123 4-6 The Inverse Transformation The area within k of the mean is the same for all normal random variables. To find a probability associated with any interval of values for any normal random variable, all that is needed is to express the interval in terms of numbers of standard deviations from the mean. That is the purpose of the standard normal transformation. If X~N(50,102), x 70 70 50 P Z P( Z 2) P( X 70) P 10 That is, P(X >70) can be found easily because 70 is 2 standard deviations above the mean of X: 70 = + 2. P(X > 70) is equivalent to P(Z > 2), an area under the standard normal distribution. Normal Distribution: = 124, = 12 Example 4-12 X~N(124,122) P(X > x) = 0.10 and P(Z > 1.28) 0.10 x = + z = 124 + (1.28)(12) = 139.36 0.04 . . . ... ... ... . . . .07 . . . 0.3790 0.3980 0.4147 . . . .08 . . . 0.3810 0.3997 0.4162 . . . .09 . . . 0.3830 0.4015 0.4177 . . . f(x) 0.03 z . . . 1.1 1.2 1.3 . . . 0.02 0.01 0.01 0.00 80 130 X 139.36 180 Finding Values of a Normal Random Variable, Given a Probability Normal Distribution: = 2450, = 400 0.0012 . 0.0010 . f(x) 0.0008 . 0.0006 . 0.0004 . 0.0002 . 0.0000 1000 2000 3000 4000 X S tand ard Norm al D istrib utio n 0.4 0.3 f(z) 1. Draw pictures of the normal distribution in question and of the standard normal distribution. 0.2 0.1 0.0 -5 -4 -3 -2 -1 0 Z 1 2 3 4 5 11-124 Finding Values of a Normal Random Variable, Given a Probability Normal Distribution: = 2450, = 400 0.0012 . .4750 0.0010 . .4750 0.0008 . f(x) 1. Draw pictures of the normal distribution in question and of the standard normal distribution. 0.0006 . 0.0004 . 0.0002 . .9500 0.0000 1000 2000 3000 4000 X S tand ard Norm al D istrib utio n 0.4 .4750 .4750 0.3 f(z) 2. Shade the area corresponding to the desired probability. 0.2 0.1 .9500 0.0 -5 -4 -3 -2 -1 0 Z 1 2 3 4 5 11-125 Finding Values of a Normal Random Variable, Given a Probability Normal Distribution: = 2450, = 400 3. From the table of the standard normal distribution, find the z value or values. 0.0012 . .4750 0.0010 . .4750 0.0008 . f(x) 1. Draw pictures of the normal distribution in question and of the standard normal distribution. 0.0006 . 0.0004 . 0.0002 . .9500 0.0000 1000 2000 3000 4000 X 2. Shade the area corresponding to the desired probability. S tand ard Norm al D istrib utio n 0.4 .4750 . . . ... ... ... . . .05 . . . 0.4678 0.4744 0.4798 . . .06 . . . 0.4686 0.4750 0.4803 . . .4750 0.3 f(z) z . . . 1.8 1.9 2.0 . . .07 . . . 0.4693 0.4756 0.4808 . . 11-126 0.2 0.1 .9500 0.0 -5 -4 -3 -2 -1 0 1 2 Z -1.96 1.96 3 4 5 Finding Values of a Normal Random Variable, Given a Probability Normal Distribution: = 2450, = 400 3. From the table of the standard normal distribution, find the z value or values. 0.0012 . .4750 0.0010 . .4750 0.0008 . f(x) 1. Draw pictures of the normal distribution in question and of the standard normal distribution. 11-127 0.0006 . 0.0004 . 0.0002 . .9500 0.0000 1000 2000 3000 4000 X 2. Shade the area corresponding to the desired probability. 0.4 .4750 . . . ... ... ... . . .05 . . . 0.4678 0.4744 0.4798 . . .06 . . . 0.4686 0.4750 0.4803 . . .4750 0.3 f(z) z . . . 1.8 1.9 2.0 . . 4. Use the transformation from z to x to get value(s) of the original random variable. S tand ard Norm al D istrib utio n .07 . . . 0.4693 0.4756 0.4808 . . 0.2 0.1 .9500 0.0 -5 -4 -3 -2 -1 0 1 2 Z -1.96 1.96 3 4 5 x = z = 2450 ± (1.96)(400) = 2450 ±784=(1666,3234) Finding Values of a Normal Random Variable, Given a Probability The normal distribution with = 3.5 and = 1.323 is a close approximation to the binomial with n = 7 and p = 0.50. P(x<4.5) = 0.7749 Normal Distribution: = 3.5, = 1.323 Binomial Distribution: n = 7, p = 0.50 0.3 0.3 P( x 4) = 0.7734 0.2 f(x) P(x) 0.2 0.1 0.1 0.0 0.0 0 5 10 0 1 2 3 X 4 5 6 7 X MTB > cdf 4.5; SUBC> normal 3.5 1.323. Cumulative Distribution Function MTB > cdf 4; SUBC> binomial 7,.5. Cumulative Distribution Function Normal with mean = 3.50000 and standard deviation = 1.32300 Binomial with n = 7 and p = 0.500000 x P( X <= x) 4.5000 0.7751 x P( X <= x) 4.00 0.7734 11-128 11-129 FOR ANY RESEARCH WE ARE ALWAYS INTERESTED TO UNDERSTAND THE POPULATION PARAMETER SO THAT DECISIONS CAN BE MADE BASED ON INFORMATION. EX: A MARKETER MAY BE INTERESTED TO KNOW AVERAGE CONSUMPTION OF SUGAR PER HOUSEHOLD PER MONTH IN THE CITY OF DELHI. THIS INFORMATION IS THE POPULATION PARAMETER WHERE THE WHOLE OF CITY DELHI HOUSEHOLD IS THE POPULATION AND THE AVERAGE CONSUMPTION OF SUGAR IS THE PARAMETER REPRESENTED BY ‘µ’ 11-130 HOWEVER, FINDING THIS PARAMETER IS DIFFICULT AS IT WILL BE VIRTUALLY IMPRACTICAL TO CONTACT ALL THE HOUSEHOLD OF DELHI ( OR TIME TAKEN WOULD BE VERY LARGE) AND THE PURPOSE OF THE STUDY ITSELF MAY BE TIME BARED. HENCE WE MUST RESORT TO COLLECTING THE INFORMATION FROM ONLY A SUBSET OF THE POPULATION WHICH IS CALLED THE SAMPLE. THIS SAMPLE INFORMATION FOR THE SAME VARIABLE IS REFERRED TO AS THE STATISTIC ( x with a bar on the top ) 11-131 HOWEVER SAMPLE MEAN IS NOT EQUAL TO POPULATION TO MEAN AND THE DIFFERENCE IN THE SAME IS THE ERROR IN ESTIMATING THE PARAMETER ( KNOWN AS TOTAL ERROR) THIS ERROR OCCURS FOR SEVERAL REASONS. 11-132 Sample vs. Census Type of Study Conditions Favoring the Use of Sample Census 1. Budget Small Large 2. Time available Short Long 3. Population size Large Small 4. Variance in the characteristic Small Large 5. Cost of sampling errors Low High 6. Cost of nonsampling errors High Low 11-133 THUS IT IS CLEAR THAT SAMPLING IS REQUIRED AND IF SAMPLE SIZE IS PROPERLY CHOSEN THEN THE ERROR IS ALSO CAN BE KEPT AT A MINIMUM LEVEL. 11-134 SAMPLING DISTRIBUTION IF THE TARGET SEGMENT ( POPULATION ) CONTAINS ‘N’ ELEMENTS AND FROM THIS POPULATION WE PICK RANDOMLY ‘n’ ELEMENTS. IN HOW MANY POSSIBLE WAYS CAN WE PICK UP THESE ‘n’ ELEMENTS? N^n ways if done with replacement NCn ways if done without replacement FOR EACH OF THESE SAMPLES THERE WILL BE A SAMPLE MEAN. THE WAY THESE SAMPLE MEANS ARE SPREAD IS KNOWN AS SAMPLING DISTRIBUTION. 11-135 SAMPLING DISTRIBUTION Let us illustrate the concept of Sampling Distribution: Consider a population consisting of only three members ( A, B and C). If a question is asked to them as to how Many chocolates do they eat in a day, the answer is A= 1 per day, B = 2 per day and C = 3 per day. Hence The variable is number of chocolates which is { 1, 2, 3 } . This gives the population average (µ = 2) And a variance ( σ^2) = 2/3. If sampling of size is 2 is taken with replacement let Us list all possible samples along with its sample means 11-136 SAMPLING DISTRIBUTION Possible freq prob Possible sample are Sample sample mean Mean ( 1,1) 1 ( 1,2) 1.5 1 1 1/9 ( 1,3) 2 1.5 2 2/9 ( 2,1) 1.5 2 3 3/9 ( 2,2) 2 2.5 2 2/9 ( 2,3) 2.5 3 1 1/9 ( 3,1) 2 ( 3,2) 2.5 ( 3,3) 3 Expected value of sample mean = 2 = population mean Expected variance of sample means = σ^2/n 11-137 SAMPLING DISTRIBUTION P R O B A B I L I T Y Does this appear to be normally Distributed? Yes indeed! 1 1.5 2 2.5 3 Sample mean 11-138 SAMPLING DISTRIBUTION Thus the Central Limit Theorem says that The distribution of the sample mean is always Normally distributed as long as sample size is large Such that : Expected value of sample Mean = population mean and standard deviation of sample mean = population standard deviation/n This is true irrespective of the distribution of the Population. Properties of the Sampling Distribution of the Sample Mean is more bell-shaped and symmetric. Both have the same center. The sampling distribution of the mean is more compact, with a smaller variance. Uniform Distribution (1,8) 0.2 P(X) Comparing the population distribution and the sampling distribution of the mean: The sampling distribution 0.1 0.0 1 2 3 4 5 6 7 8 X Sampling Distribution of the Mean 0.10 P(X) • 11-139 0.05 0.00 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 X 11-140 Relationships between Population Parameters and the Sampling Distribution of the Sample Mean The expected value of the sample mean is equal to the population mean: E( X ) X X The variance of the sample mean is equal to the population variance divided by the sample size: V(X) 2 X 2 X n The standard deviation of the sample mean, known as the standard error of the mean, is equal to the population standard deviation divided by the square root of the sample size: SD( X ) X X n 11-141 Sampling from a Normal Population When sampling from a normal population with mean and standard deviation , the sample mean, X, has a normal sampling distribution: This means that, as the sample size increases, the sampling distribution of the sample mean remains centered on the population mean, but becomes more compactly distributed around that population mean n 2 ) Sampling Distribution of the Sample Mean 0.4 Sampling Distribution: n =16 0.3 Sampling Distribution: n = 4 f(X) X ~ N (, 0.2 Sampling Distribution: n = 2 0.1 Normal population Normal population 0.0 11-142 The Central Limit Theorem 0.25 P(X) 0.20 0.15 0.10 0.05 0.00 X n = 20 P(X) 0.2 0.1 0.0 X When sampling from a population with mean and finite standard deviation , the sampling distribution of the sample mean will tend to a normal distribution with mean and standard deviation n as the sample size becomes large (n >30). n=5 Large n 0.4 0.2 0.1 0.0 - X For “large enough” n: X ~ N ( , / n) 2 f(X) 0.3 The Central Limit Theorem Applies to Sampling Distributions from Any Population Normal Uniform Skewed 11-143 General Population n=2 n = 30 X X X X 11-144 SAMPLING DISTRIBUTION - EXAMPLE Let us assume that we are interested in understanding What will be the average consumption of sugar per Household per month in a given target population? What this means is we are interested to get the Information ( µ = average sugar consumed /month) We can only estimate the same based on sample Information. i.e. based on sample mean . This can be Done as follows. 11-145 SAMPLING DISTRIBUTION - EXAMPLE For the example let us assume that we sampled randomly 100 household and got the information that the sample Mean was 1890 grams per household per month. Let us also assume that the population standard deviation Was known as 230 grams . We use the fact that the sample mean obtained was one Among the different sample means possible and that That the sample means would be normally distributed. Hence an interval estimate can be obtained as µ = x-bar ± Z where Z = std. normal deviate n 11-146 SAMPLING DISTRIBUTION - EXAMPLE For the example let us assume that we sampled randomly 100 household and got the information that the sample Mean was 1890 grams per household per month. Let us also assume that the population standard deviation Was known as 230 grams . Substituting the values we get µ = 1890 ± Z x (230 / √100 ) for a 90% confidence Z = 1.28 ( refer Z table ) = 1890 ± 1.28 x (230 / √100 ) = 1890 ± 29.44 there is a 90% chance that the actual (µ ) will be contained within 1860.56 to 1919.44 grams. 11-147 FROM THE EXAMPLE JUST EXPLAINED YOU CAN SEE THAT (sample mean(x-bar) – µ ) = Z Error in estimating µ Hence n n Is a function of n is often referred to as ‘standard error’ Hence if error is known then the sample size can be Determined ( this is based on sampling error alone) 11-148 Sampling without replacement When we sample without replacement and from a finite Population the standard deviation of sample means ( also known as standard error ) incorporates a Finite population multiplier and is as follows: n √ ( N-n)/(N-1) N = population size n = sample size Finite population Multiplier ( always ≤ 1) It can be noted that as N goes to ∞ the multiplier Becomes = 1 and hence the standard error is the same As if the sampling is done with replacement. 11-149 Sampling distribution for proportion Consider an example : The number of times a hotel is unable to accommodate Their customer with rooms because the hotel is full. This can only be expressed in terms of proportion i.e. 10% and so on. In this case also the sampling distribution of proportion If the sample size is large behaves like a normal Distribution with expected value of sample proportion Equal to population population and the standard error Equal to √(pq/n) , where ‘q’ = 100-p if ‘p’ in percentage 11-150 Sampling distribution for proportion Similarly interval estimation for proportion can be Found. Example if a sample of 1000 voters selected And 400 of them decided to vote for a political party (X) Then the proportion of the population that is expected To vote for the party (x) would be P = 40% ±Z √(40*60/1000) = 40% ±3.038 ( Z = 1.96 for 95% confidence level) hence the interval would be 36.92% to 43.04% 11-151 Sampling distribution of difference of two means Consider a group of male employees and a group of Female employees in IT industry at a given level. It is Desired to understand what is the level of difference In their salaries ( given there is discrimination ) 11-152 Sampling distribution of difference of two means In this situation there could be many possible samples that can be drawn of size (n1) from males and similarly many samples that can be drawn of size ( n2) from female employees. For each sample taken from each group the sample mean can be substracted which will give us the level of difference in the salary. This is difference of two sample means and the distribution would also behave like a normal distribution for large sample 11-153 Sampling distribution of difference of two means Thus the sampling distribution for Difference of two means is normally distributed With expected value (x (male) – x (female) ) =0 and variance for the difference of two means = s12 s22 + Standard n1 n2 s Error 11-154 Sampling distribution for small sample In our earlier discussion we have always emphasized The need for a large sample for the sample mean to Be distributed as a Normal Distribution What is meant by LARGE sample ? 11-155 Sampling distribution for small sample Gossett was working on samples which were Considered as small such as 10, 15, 25 etc and he Found that the sample mean distribution was not Exactly Normal distribution but was nevertheless Symmetric but with large variance. He denoted His distribution as Student ‘t’ distribution. Thus the mean value of ‘t’ = 0 and the probability Density function was not only a function of the Mean and variance but also dependent on what he Called as “degree of freedom” Confidence Interval for a Mean () with Unknown 11-156 Degrees of Freedom • • Degrees of Freedom (d.f.) is a parameter based on the sample size that is used to determine the value of the t statistic. Degrees of freedom tell how many observations are used to calculate , less the number of intermediate estimates used in the calculation. n=n-1 McGraw-Hill/Irwin © 2007 The McGraw-Hill Companies, Inc. All rights reserved. Confidence Interval for a Mean () with Unknown 11-157 Degrees of Freedom • • McGraw-Hill/Irwin As n increases, the t distribution approaches the shape of the normal distribution. For a given confidence level, t is always larger than z, so a confidence interval based on t is always wider than if z were used. © 2007 The McGraw-Hill Companies, Inc. All rights reserved. 11-158 Degree of freedom To understand the degree of freedom let us Consider the numbers 1 2 3 total 6 We will have the freedom to change any two out Of these three numbers without change in the total Thus the degree of freedom would be 2 Thus degree of freedom would be (n-1) Where n = sample size. 11-159 Sampling distribution for small sample Thus for a small sample and whenever the population Variance is unknown, the distribution of the sample Means behaves like a ‘t’ distribution . This ‘t’ distribution becomes very close to Normal Distribution when the degree of freedom is 29 and Above. Hence the definition for a large sample in statistics Is when the sample size is 30 or more. For smaller than 30 the distribution needed would be ‘t’ . Confidence Interval for a Mean () with Unknown 11-160 Student’s t Distribution • • t distributions are symmetric and shaped like the standard normal distribution. The t distribution is dependent on the size of the sample. McGraw-Hill/Irwin © 2007 The McGraw-Hill Companies, Inc. All rights reserved. 11-161 Sampling distribution for small sample Thus for all calculations with small samples Z value will be substituted with ‘t’ values. Usually small samples are not used when Proportions are involved. 11-162 Confidence Interval for a Mean () with Unknown Student’s t Distribution Use the Student’s t distribution instead of the normal distribution when the population is normal but the standard deviation is unknown and the sample size is small. s x+t n The confidence interval for (unknown ) is s s x-t x+t n McGraw-Hill/Irwin << n © 2007 The McGraw-Hill Companies, Inc. All rights reserved. Confidence Interval for a Mean () with Unknown 11-163 Student’s t Distribution McGraw-Hill/Irwin © 2007 The McGraw-Hill Companies, Inc. All rights reserved. Confidence Interval for a Mean () with Unknown 11-164 Comparison of z and t • • • • McGraw-Hill/Irwin For very small samples, t-values differ substantially from the normal. As degrees of freedom increase, the tvalues approach the normal z-values. For example, for n = 31, the degrees of freedom are: n = 31 – 1 = 30 What would the t-value be for a 90% confidence interval? © 2007 The McGraw-Hill Companies, Inc. All rights reserved. Confidence Interval for a Mean () with Unknown 11-165 Comparison of z and t For n = 30, the corresponding z-value is 1.645. McGraw-Hill/Irwin © 2007 The McGraw-Hill Companies, Inc. All rights reserved. Confidence Interval for the Difference of Two Means, small sample 1 – 2 • 11-166 The procedure for constructing a confidence interval for 1 – 2 depends on our assumption about the unknown variances. Assuming equal variances: (x1 – x2) + t (n1 – 1)s12 + (n2 – 2)s22 n1 + n2 - 2 1 1 + n1 n2 with n = (n1 – 1) + (n2 – 1) degrees of freedom McGraw-Hill/Irwin © 2007 The McGraw-Hill Companies, Inc. All rights reserved. Confidence Interval for the Difference of Two Means, small sample 1 – 2 11-167 Assuming equal variances: (x1 – x2) + t (n1 – 1)s12 + (n2 – 2)s22 n1 + n2 - 2 with n = (n1 – 1) + (n2 – 1) degrees of freedom McGraw-Hill/Irwin 1 1 + n1 n2 Pooled standard deviation © 2007 The McGraw-Hill Companies, Inc. All rights reserved. Confidence Interval for the Difference of Two Means, small sample 1 – 2 11-168 Assuming equal variances: (x1 – x2) + t (n1 – 1)s12 + (n2 – 2)s22 n1 + n2 - 2 with n = (n1 – 1) + (n2 – 1) degrees of freedom McGraw-Hill/Irwin 1 1 + n1 n2 Standard Error for Differences Of means © 2007 The McGraw-Hill Companies, Inc. All rights reserved. 11-169 F- Distribution F – distribution ( Fisher’s ) is the ratio of the variations. If two samples are drawn and we wish to know Whether the samples are drawn from a single population Or from two separate population, then an F- Statistic Is calculated. This F- Statistic = ratio of samples variances of the two samples. ( S12 / S22 ) 11-170 F – distribution curve F – Distribution is a probability density function whose shape of the curve is as follows: y F- Statistic 11-171 F- distribution We will have more occasions to talk about This F- statistic later while discussing Hypothesis testing. 11-172 11-173 Chi-Square - Distribution Chi-Square Distributed ( ) is a distribution When we wish to estimate the population variance From a known sample variance. Similarly there are many non parametric tests Where we would use a Chi-Square tests. The shape of the Chi-square distribution varies with The degree of freedom 11-174 Chi-Square distribution y Chi-square statistic 11-175 Chi-square distribution Square of the Z distribution behaves like a Chi-Square distribution. Similarly a sum of the square of several Normal Distribution also behaves like a Chi-Square Distribution. We will have more to talk about this distribution When we look at hypothesis testing. 11-176 11-177 Visual Displays and Correlation Analysis Visual Displays • • • Begin the analysis of bivariate data (i.e., two variables) with a scatter plot. A scatter plot - displays each observed data pair (xi, yi) as a dot on an x-y grid indicates visually the strength of the relationshi between the two variables . 11-178 Visual Displays and Correlation Analysis Visual Displays cost of maintenance per month maintenance cost 25000 20000 15000 10000 5000 0 0 5 10 15 20 25 30 35 hrs driven per week McGraw-Hill/Irwin © 2007 The McGraw-Hill Companies, Inc. All rights reserved. 11-179 Visual Displays and Correlation Analysis Correlation Analysis Weak Positive Correlation Strong Positive Correlation McGraw-Hill/Irwin © 2007 The McGraw-Hill Companies, Inc. All rights reserved. 11-180 Visual Displays and Correlation Analysis Correlation Analysis Strong Negative Correlation Weak Negative Correlation McGraw-Hill/Irwin © 2007 The McGraw-Hill Companies, Inc. All rights reserved. 11-181 Visual Displays and Correlation Analysis Correlation Analysis Nonlinear Relation No Correlation McGraw-Hill/Irwin © 2007 The McGraw-Hill Companies, Inc. All rights reserved. 11-182 Visual Displays and Correlation Analysis Correlation Analysis • The sample correlation coefficient (r) measures the degree of linearity in the relationship between X and Y. -1 < r < +1 Strong negative relationship • Strong positive relationship r = 0 indicates no linear relationship McGraw-Hill/Irwin © 2007 The McGraw-Hill Companies, Inc. All rights reserved. 11-183 Visual Displays and Correlation Analysis Correlation Analysis Cov ( x,y) = ------------s(x) s(y) McGraw-Hill/Irwin © 2007 The McGraw-Hill Companies, Inc. All rights reserved. 11-184 Correlation coefficient can also be found As follows: r= (n(∑xy) - ( ∑x ∑y) √{(n∑x2 ) – (∑x)2 } x {(n∑y2) –(∑y)2} 11-185 11-186 Use of Excel for finding the correlation 11-187 Use of Excel for finding the correlation 11-188 Use of Excel for finding the correlation 11-189 Use of Excel for finding the correlation 11-190 Correlation coefficient can also be found As follows: r= (n(∑xy) - ( ∑x ∑y) √{(n∑x2 ) – (∑x)2 } x {(n∑y2) –(∑y)2} 0.9322 Hence r^2 = 0.869 11-191 Properties of correlation coefficient 1. The value of ‘r’ always varies between -1 to +1 2. The change of origin and scale does not effect the value of the coefficient ( what this means is as follows) 11-192 Change of origin and scale means 11-193 Properties of correlation coefficient 1. The value of ‘r’ always varies between -1 to +1 2. The change of origin and scale does not effect the value of the coefficient 3. If ‘x’ and ‘y’ are interchanged the coefficient is not effected. i.e. it remains unaltered. ( we usually refer to ‘x’ as independent variable ‘y’ as dependent variable 4. the fourth property can be explained after we explain regression ( hence hold till such time ) 11-194 Bivariate Regression What is Bivariate Regression? • • • McGraw-Hill/Irwin Bivariate Regression analyzes the relationship between two variables. It specifies one dependent (response) variable and one independent (predictor) variable. This hypothesized relationship may be linear, quadratic, or whatever. © 2007 The McGraw-Hill Companies, Inc. All rights reserved. 11-195 Bivariate Regression Chart Title cost of maintenance 25000 20000 15000 10000 5000 0 -5000 0 5 10 15 20 25 30 35 hrs of vehicle driven McGraw-Hill/Irwin © 2007 The McGraw-Hill Companies, Inc. All rights reserved. 11-196 Bivariate Regression In the equation y= a + bx how to find the value of ‘a’ and ‘b’ which are the intercept and slope. Chart Title cost of maintenance 25000 20000 15000 10000 5000 0 -5000 0 5 10 15 20 25 30 35 hrs of vehicle driven McGraw-Hill/Irwin © 2007 The McGraw-Hill Companies, Inc. All rights reserved. 11-197 Bivariate Regression In the equation y= a + bx how to find the value of ‘a’ and ‘b’ which are the intercept and slope. McGraw-Hill/Irwin © 2007 The McGraw-Hill Companies, Inc. All rights reserved. 11-198 How to develop a Regression line 11-199 How to develop a Regression line 11-200 Normal Equations ∑ Y = na + b∑X ∑XY = a∑X + b∑X2 Which when simplified becomes (n∑ XY) – (∑X ∑Y) b= ----------------------(n∑X^2 ) - (∑X)^2 a= Y(bar) – b X ( bar ) 11-201 Normal Equations For the problem considered earlier b = 777.32 ( y – dependent variable x – independent variable) And a= -6115.9 11-202 Normal Equations If we had desired the normal equations for the Situation when ‘x’ = dependent variable and ‘y’ = independent variable The normal equations would simply change so that Where there ‘x’ replace with ‘y’ and replace ‘y’ with ‘x’ You will find the numerator would remain unchanged But the denominator would be (n∑X^2 ) - (∑X)^2 11-203 Normal Equations Hence the new a’ = 9.393 b’ = 0.001118 when x = dependent variable y= independent variable 11-204 Properties of correlation coefficient 1. The value of ‘r’ always varies between -1 to +1 2. The change of origin and scale does not effect the value of the coefficient 3. If ‘x’ and ‘y’ are interchanged the coefficient is not effected. i.e. it remains unaltered. ( we usually refer to ‘x’ as independent variable ‘y’ as dependent variable 4. b x b’ = r^2 which is 777.32 x 0.001118 = 0.869 11-205 11-206 From here it follows: 1. Both the regression coefficients must have the same sign ( either + or - ) 2. If one regression coefficient is greater than 1 then the other regression coefficient must be <1. 3. If one regression coefficient is < 1 then the other regression coefficient may be > or < than 1. 11-207 Regression Terminology Fitting a Regression on a Scatter Plot in Excel • McGraw-Hill/Irwin Step 1: - Highlight the data columns. - Click on the Chart Wizard and choose Scatter Plot - In the completed graph, click once on the points in the scatter plot to select the data - Right-click and choose Add Trendline - Choose Options and check Display Equation © 2007 The McGraw-Hill Companies, Inc. All rights reserved. 11-208 Fitting a Regression on a Scatter Plot in Excel McGraw-Hill/Irwin © 2007 The McGraw-Hill Companies, Inc. All rights reserved. 11-209 Fitting a Regression on a Scatter Plot in Excel McGraw-Hill/Irwin © 2007 The McGraw-Hill Companies, Inc. All rights reserved. 11-210 Fitting a Regression on a Scatter Plot in Excel McGraw-Hill/Irwin © 2007 The McGraw-Hill Companies, Inc. All rights reserved. 11-211 Regression Terminology McGraw-Hill/Irwin © 2007 The McGraw-Hill Companies, Inc. All rights reserved. 11-212 11-213 WHAT IS A HYPOTHESIS A hypothesis is a conjectural statement about the a Certain characteristic in the whole population or target Segment. Ex: What is the average expenditure per month incurred on vehicle maintenance. If it is suggested that this average is Rs.1200 per month, then this will be a hypothesis. What this implied is that if we take all the people who drive their Vehicle and find the expenditure of everyone and average the results It would be Rs. 1200 per month. 11-214 WHAT IS A HYPOTHESIS A hypothesis is a conjectural statement about the a Certain characteristic in the whole population or target Segment. Ex: A car dealer claims that on an average the mileage of a car ( given model ) gives at least 15 kms to a litre of petrol. This implied that for the given model, if we take the average of All the vehicles and find its mileage per litre of petrol, it would be At least 15 kms to a litre of petrol. 11-215 WHAT IS A HYPOTHESIS A hypothesis is a conjectural statement about the a Certain characteristic in the whole population or target Segment. Ex: An exporter claims that the proportion of defects in his consignment will be at most 2% . This means that if we take all his consignments and find the proportion Of defects, the average defectives will not exceed 2%. 11-216 WHAT IS A HYPOTHESIS A hypothesis is a conjectural statement about the a Certain characteristic in the whole population or target Segment. Ex: A refill manufacturer for ball point pens claims that the length of the refill on an average is 140 mm. This would imply that while each of the refill’s may ( or cannot) Be exactly 140 mm but on an average the length of refills would be 140 mm. 11-217 What do all these hypothesis show? 1. They are all conjectures about the population parameter. 2. They talk always about the population parameter 3. They are statements made only on the basis of the research question in hand and not on the basis of the data collected. 11-218 Why do we need to do hypothesis testing While we want to verify the statement specified in the Hypothesis, it would be impossible to do so without doing A census. Hence if a census is carried out then Hypothesis testing is not essential. However we know that a census is not practical and also Need not be accurate. Hence we need to comment on the Hypothesis on the basis of sample information only. Hence we draw inference about the population parameter Based on sample information, this drawing of inference is Called hypothesis testing. 11-219 Characteristics of a good hypothesis 1. hypothesis should be based on sound previous research 2. look for realistic explanations 3. state the variables clearly 4. it should be easily amenable to test 5. measure the variables in the correct scale 11-220 Basics of hypothesis testing As we said before the hypothesis is about the characteristics About the population. We usually call it the ‘parameter’ Parameter can only be obtained by doing a census Which is not possible or not practical. Hence our inference about the parameter is based on the Sample information. Hence based on a sample information we may reject a true Hypothesis and conversely we may accept a hypothesis When it is actually not true. Both of these are errors in making the inference. 11-221 Basics of hypothesis testing Consider the problem: A car dealer claims that on an average the mileage of a car ( given model ) gives at least 15 kms to a litre of petrol The problem here is that if the average is more than 15 km/litre then we are satisfied but if it gives less than than What is claimed by the dealer then we have difficulty in Believing the claim of the dealer. Hence this is what we Wish to verify or infer. This inference must be based on The information available from a single sample of size ‘n’ This ‘n’ can be either 30 cars or 40 cars or even about 20 cars. 11-222 Basics of hypothesis testing Consider the problem: A car dealer claims that on an average the mileage of a car ( given model ) gives at least 15 kms to a litre of petrol Hence we can write the hypothesis as follows: Null Hypothesis : Generally we not disagree with the dealer to begin with unless there is sufficient evidence to disagree. Hence we write Null hypothesis: ( Ho ): µ ≥ 15 Alternative hypothesis: ( Ha ): µ < 15 11-223 Basics of hypothesis testing Consider the problem: A car dealer claims that on an average the mileage of a car ( given model ) gives at least 15 kms to a litre of petrol ( Ho ): µ ≥ 15 ( Ha ): µ < 15 11-224 Approaches for hypothesis testing 11-225 Errors in hypothesis testing Hypothesis true Hypothesis false Accept hypothesis No error Type II error or ß- error Reject hypothesis Type I error or No error alpha error 11-226 Steps in Hypothesis Testing 1. Based on the research question develop Null and Alternative hypothesis 2. Decide on the level of Type I Error or alpha error. 3. Decide whether the test is a single tail or a two tail 11-227 Steps in Hypothesis Testing 4. Decide on the appropriate Test Statistic which will be used ( Z or t or any other ) 5. Calculate the test statistic. 11-228 Steps in Hypothesis Testing 6. Read the test statistic for the level of type I error from The table of Z or ‘t’ etc. 7. Compare the calculated test statistic with that of the table value. 8. Make a conclusion. 11-229 Worked Example -1: A manufacturing firm has been averaging shipping of a product within 30 days of receiving the order. Of late it is believed that the average shipping time has increased. To test this a sample of size 49 is drawn randomly from the Shipments made during a given period of time. The sample shows an average shipping time of 36 days. The population standard deviation is believed to be 7 days. Is there sufficient evidence to believe that the shipping is getting delayed. A 5% level of significance test is thought to be good. Step 1: Formulate the null and alternative hypothesis: Ho: µ ≤ 30 Ha: µ > 30 11-230 Worked Example -1 contd: A manufacturing firm has been averaging shipping of a product within 30 days of receiving the order. Of late it is believed that the average shipping time has increased. To test this a sample of size 49 is drawn randomly from the Shipments made during a given period of time. The sample shows an average shipping time of 36 days. The population standard deviation is believed to be 7 days. Is there sufficient evidence to believe that the shipping is getting delayed. A 5% level of significance test is thought to be good. Step 2: Decide on the level of significance or type I error This is given in the problem as 5% Step 3: Looking at the hypothesis it is clear that it is single tail and the problem area is to the right hence right tail. 11-231 Worked Example -1 contd: A manufacturing firm has been averaging shipping of a product within 30 days of receiving the order. Of late it is believed that the average shipping time has increased. To test this a sample of size 49 is drawn randomly from the Shipments made during a given period of time. The sample shows an average shipping time of 36 days. The population standard deviation is believed to be 7 days. Is there sufficient evidence to believe that the shipping is getting delayed. A 5% level of significance test is thought to be good. Step 4: decide on the test statistic. Since sample size is > 30 we can go for the Z statistic. Step 5: Calculate Z statistic = (x-bar - µ )/ S.E. ( standard error ) = (36 – 30 )/ (7 / √49) = 6 11-232 Worked Example -1 contd: A manufacturing firm has been averaging shipping of a product within 30 days of receiving the order. Of late it is believed that the average shipping time has increased. To test this a sample of size 49 is drawn randomly from the Shipments made during a given period of time. The sample shows an average shipping time of 36 days. The population standard deviation is believed to be 7 days. Is there sufficient evidence to believe that the shipping is getting delayed. A 5% level of significance test is thought to be good. Step 5: Calculate Z statistic = (x-bar - µ )/ S.E. ( standard error ) = (36 – 30 )/ (7 / √49) = 6 Step 6: The Z ( table value at 5% level of alpha), single tail = 1.645. Step 7. Since the Z (cal) > Z ( table value ) hence reject or unable to accept the null hypothesis 11-233 Worked Example -1 contd: ( single mean) A manufacturing firm has been averaging shipping of a product within 30 days of receiving the order. Of late it is believed that the average shipping time has increased. To test this a sample of size 49 is drawn randomly from the Shipments made during a given period of time. The sample shows an average shipping time of 36 days. The population standard deviation is believed to be 7 days. Is there sufficient evidence to believe that the shipping is getting delayed. A 5% level of significance test is thought to be good. Step 7. Since the Z (cal) > Z ( table value ) hence reject or unable to accept the null hypothesis Step 8. Conclusion: There appears to be a delay in the shipping time in recent times. 11-234 Worked example :-2 ( single mean) Let us consider the car example: The dealer claims that a particular model gives at least 15Km/litre of fuel. A random sample of 36 cars gives a mean of 14.6 km/litre and a population standard deviation is assumed known as 0.75 km/litre. Assume 5% level of significance Step 1. Ho: µ ≥ 15 Ha: µ < 15 Step 2: The significance level is given as 5% . Step 3: This is also a single tail test but the direction is towards the left. Step 4: Since the sample size is large ( 36) we can use the Z – test. 11-235 Worked example :-2 ( single mean) Let us consider the car example: The dealer claims that a particular model gives at least 15Km/litre of fuel. A random sample of 36 cars gives a mean of 14.6 km/litre and a population standard deviation is assumed known as 0.75 km/litre. Assume 5% level of significance Step 4: Since the sample size is large ( 36) we can use the Z – test. Step 5: Calculate the Z – statistic Z = (14.6 – 15 ) / ( 0.75/ √36) = -0.4 / (0.75/6) = -3.2 Step 6: Evaluate the value of Z at 5% level of significance. (remember this is now to the left side, hence ‘Z’ should be negative. hence it is Z = -1.645. 11-236 Worked example :-2 ( single mean) Let us consider the car example: The dealer claims that a particular model gives at least 15Km/litre of fuel. A random sample of 36 cars gives a mean of 14.6 km/litre and a population standard deviation is assumed known as 0.75 km/litre. Assume 5% level of significance Step 5: Calculate the Z – statistic Z = (14.6 – 15 ) / ( 0.75/ √36) = -0.4 / (0.75/6) = -3.2 Step 6: Evaluate the value of Z at 5% level of significance. (remember this is now to the left side, hence ‘Z’ should be negative. hence it is Z = -1.645. Step 7. Compare Z calculated with Z table value for a left tail test the rule is if Z calculated ≤ Z table value reject Ho else accept Ho. 11-237 Worked example :-2 ( single mean) Let us consider the car example: The dealer claims that a particular model gives at least 15Km/litre of fuel. A random sample of 36 cars gives a mean of 14.6 km/litre and a population standard deviation is assumed known as 0.75 km/litre. Assume 5% level of significance Step 6: Evaluate the value of Z at 5% level of significance. (remember this is now to the left side, hence ‘Z’ should be negative. hence it is Z = -1.645. Step 7. Compare Z calculated with Z table value for a left tail test the rule is if Z calculated ≤ Z table value reject Ho else accept Ho. In this Z(cal) = - 3.2 < Z ( table value) -1.645 hence reject Ho. Step 8. Conclusion is that the dealer claim cannot be accepted 11-238 Worked Example -3: ( single mean) A refill manufacturer claims that the refill length for a ball Point pen is 140 mm long. A sample of size 100 is selected and finds that the mean length of refills is 141.77mm with a standard deviation of 5.88 mm. At 5% level of significance Can it be concluded that the refills are of poor quality. Step1: Null hypothesis and alternative hypothesis: Ho: µ = 140 mm Ha: µ ≠ 140 mm Step 2: Alpha level is given at 5% level. Step 3. In this case it is a two tail test as both sides are not acceptable because the refill will not fit in the pen and hence poor quality. 11-239 Worked Example -3 contd: ( single mean) A refill manufacturer claims that the refill length for a ball Point pen is 140 mm long. A sample of size 100 is selected and finds that the mean length of refills is 141.77mm with a standard deviation of 5.88 mm. At 5% level of significance Can it be concluded that the refills are of poor quality. Step 4. Since the sample size is 100 which is large hence Z test can be done. Step 5. Calculate the Z statistic : Z= (141.77 – 140) / ( 5.88 / √100) = 1.77 / 0.588 = 3.01 Step 6. Read the Z statistic from table at 5% two tail = 1.96 11-240 Worked Example -3 contd: ( single mean) A refill manufacturer claims that the refill length for a ball Point pen is 140 mm long. A sample of size 100 is selected and finds that the mean length of refills is 141.77mm with a standard deviation of 5.88 mm. At 5% level of significance Can it be concluded that the refills are of poor quality. Step 5. Calculate the Z statistic : Z= (141.77 – 140) / ( 5.88 / √100) = 1.77 / 0.588 = 3.01 Step 6. Read the Z statistic from table at 5% two tail = 1.96 Step 7: Z ( cal ) > Z ( table value) => hence reject Null hypothesis. Step 8. Conclusion: The refills produced are or poor quality 11-241 Rule for rejecting a Null hypothesis For right tail test ( single tail ): If Z ( cal ) ≥ Z ( table value ) Reject Ho For a left tail test ( single tail) : If Z ( cal ) ≤ Z ( table value ) Reject Ho For a two tail test : If |Z ( cal )| ≥ |Z (table value)| Reject Ho It can be observed that even for a single tail Test if we consider the modulus value of Z then The same rule as that for two tail test can be Used for rejecting Ho ( use caution here ) 11-242 When population σ is unknown While conducting hypothesis testing, usually population σ is unknown. Under these situations the sample standard deviation ( s ) is used instead of σ and Hence the standard error would be = ( s/√ n). It must be made sure that sample standard deviation Must be calculated with ( n-1) in the denominator As stated in the earlier subject DRM 01 as only then It becomes unbiased estimator of ‘σ’. Further it must be ensured that the sample size should Be large ( definition of large was n > 30) 11-243 When sample size is small <30 When the sample size is < 30, it is considered as a Small sample and hence ‘t’ distribution should be used Instead of ‘z’ . Further if the population standard deviation Is unknown, then also it is recommended that ‘t’ Distribution is used. Hence the only change is : (Z) statistic = (x-bar - µ )/ S.E. ( standard error) Replace with ‘t’ 11-244 Worked example – single proportion Insurance companies have recently created difficulty in settling medical Claims directly to the hospitals. One reason can be attributed to false Billing by individuals who have taken medical insurance. A company Believes that of recent there has been an increase in the number of False medical claims which has gone up to 5%. A random sample of 100 customers indicated that 7 customers had falsified their claim. Is there any reason to believe that false medical claims have gone up? Use 5% level of significance. Step 1: Ho : p ≤ 5% Ha : p > 5% Step 2: level of significance given as 5% Step 3: this is a single tail ( right tail ) test. 11-245 Worked example – single proportion Insurance companies have recently created difficulty in settling medical Claims directly to the hospitals. One reason can be attributed to false Billing by individuals who have taken medical insurance. A company Believes that of recent there has been an increase in the number of False medical claims which has gone up to 5%. A random sample of 100 customers indicated that 7 customers had falsified their claim. Is there any reason to believe that false medical claims have gone up? Use 5% level of significance. Step 1: Ho : p ≤ 5% Ha : p > 5% Step 2: level of significance given as 5% Step 3: this is a single tail ( right tail ) test. Step 4. Z statistic will be used as sample size is large Step 5. Z = (ṕ - p )/ standard error (ṕ - p )/ (√pq/n) = (7%-5%)/√(5%x95%)/100) 11-246 Worked example – single proportion Insurance companies have recently created difficulty in settling medical Claims directly to the hospitals. One reason can be attributed to false Billing by individuals who have taken medical insurance. A company Believes that of recent there has been an increase in the number of False medical claims which has gone up to 5%. A random sample of 100 customers indicated that 7 customers had falsified their claim. Is there any reason to believe that false medical claims have gone up? Use 5% level of significance. Step 1: Ho : p ≤ 5% Ha : p > 5% Step 2: level of significance given as 5% Step 3: this is a single tail ( right tail ) test. Step 4. Z statistic will be used as sample size is large Step 5. Z = (ṕ - p )/ standard error (ṕ - p )/ (√pq/n) = (7%-5%)/√(5%x95%)/100) = 2/ (2.179) = 0.91 Step 6. Z table value at 5% single tail = 1.645 Step 7 : Compare Z cal with Z table value 11-247 Worked example – single proportion Insurance companies have recently created difficulty in settling medical Claims directly to the hospitals. One reason can be attributed to false Billing by individuals who have taken medical insurance. A company Believes that of recent there has been an increase in the number of False medical claims which has gone up to 5%. A random sample of 100 customers indicated that 7 customers had falsified their claim. Is there any reason to believe that false medical claims have gone up? Use 5% level of significance. Step 5. Z = (ṕ - p )/ standard error (ṕ - p )/ (√pq/n) = (7%-5%)/√(5%x95%)/100) = 2/ (2.179) = 0.91 Step 6. Z table value at 5% single tail = 1.645 Step 7 : Compare Z cal with Z table value hence Z cal < Z table value Hence accept Ho Step 8 : We cannot conclude that the number of false claims have gone beyond 5%. Even though sample shows a 7%. 11-248 Hypothesis test for difference of two means Let us consider the following situations: Case-1: Does going to VLCC help in reducing weight ? Case -2: Does the company always assess the rent for the residential quarter less than the employee himself? Case-3: Is their gender discrimination among employers in a given industry for the same level of job? Case -4: Is the new drug more effective in treatment of a disease than the existing drug? 11-249 In all cases we are taking about the difference of Two mean. Case-1: The mean before joining VLCC and the mean after joining VLCC (µbefore -µ after ) Case 2: The mean value of residential quarter assessed by the company and mean value of residential quarter assessed by the employee for whom it is meant. (µ(company) - µ ( employee) Case 3: The mean wage given to women employees and the mean wage given to men employees (µ (women) - µ ( men ) Case 4: The mean time to recover with new drug and mean time to recover with existing drug. µ (new drug ) - µ ( old drug ) 11-250 Despite each of these cases being a difference of Two means; there is one essential difference:Case 1 & 2: In both these cases we are talking about the same Sample . Case 1: - Same sample weight before joining VLCC same sample weight after joining VLCC Case 2: Same house assessed by Company Same house assessed by employee 11-251 Despite each of these cases being a difference of Two means; there is one essential difference:In Cases 3 & 4:Case 3. mean wages for a group of females mean wages for a group of men each sample is independently drawn Case 4: mean time of recovery using new drug mean time of recovery using existing drug each sample is independently drawn. ( obviously the same patient cannot be both the drugs ) 11-252 Difference of two means Difference of two means Dependent Sample Samples independently drawn Both situations are treated differently. 11-253 Difference of two means – dependent samples Ex: It was intended to understand whether there is any difference in the productivity of a worker immediately after a weekly off or immediately before a weekly off. If the weekly off was on a Sunday, it was desired to find if the productivity is different on Saturday or on a Monday. Hence productivity was measured on Saturday’s and Monday’s for the same set of workers. The data are as follows: ( use 5% level of significance) Worker Id Productivity Sat Mon 1 25 28 2 32 29 3 20 29 4 26 36 5 29 35 6 21 30 7 18 32 8 17 24 9 27 25 11-254 Dependent sample case: Step 1: To develop a Null and alternative hypothesis Ho: µ (sat) = µ (mon) Ha: µ (sat) ≠ µ (mon) the problem does not suggest that productivity is more on Saturday’s than on Monday’s or vice-versa Hence it could either way. Hence ≠ symbol in the Alternative hypothesis: This also implies: Ho: µ (sat) - µ (mon) = 0 or difference=0 Ha: µ (sat) - µ (mon) ≠ 0 or difference ≠0 11-255 Difference of two means – dependent samples Worker Id 1 2 3 4 5 6 7 8 9 Productivity Sat Mon 25 28 32 29 20 29 26 36 29 35 21 30 18 32 17 24 27 25 difference -3 3 -9 -10 -6 -9 -14 -7 2 average difference = - 5.88 sample standard deviation(s) = 5.622 11-256 Dependent sample case: Step 2: Level of alpha is given as 5% Step 3: This is a two tail test Step 4: The test statistic will be a ‘t’ distribution as the sample size is small and also the ‘σ’ is unknown. t (cal) = diff – 0 / ( s/√ n) = (- 5.88 – 0) / (5.622/ √9 ) = -5.88 / 1.874 = -3.13 or modulus = 3.13 Step 5: find the table value of ‘t’ for 5% two tail with a degree of freedom of 8 ( n-1) = 2.306. 11-257 Dependent sample case: Step 2: Level of alpha is given as 5% Step 3: This is a two tail test Step 4: The test statistic will be a ‘t’ distribution as the sample size is small and also the ‘σ’ is unknown. t (cal) = diff – 0 / ( s/√ n) = (- 5.88 – 0) / (5.622/ √9 ) = -5.88 / 1.874 = -3.13 or modulus = 3.13 Step 5: find the table value of ‘t’ for 5% two tail with a degree of freedom of 8 ( n-1) = - 2.306. Step 6: Compare t(cal) with t(Table value) If |t(cal)|> |t(table value) reject Ho hence 3.13 > 2.306 Step 7: Reject Ho Step 8 Conclusion : There is a change in the productivity between before a weekend and after a weekend. 11-258 Dependent sample case: Hence it is clear that in the case of a difference Of two means – dependent sample case Is treated as if it is a single mean case; Further data is always obtained in pairs. And the sample sizes are usually less than 30 Hence a ‘t’ is normally used. This test is also called Paired ‘t’ test 11-259 DIFFERENCE OF TWO MEANS – INDEPENDENT SAMPLE Consider case 3 and 4 discussed earlier. Reproducing Below both the cases for ready reference Case 3. mean wages for a group of females mean wages for a group of men each sample is independently drawn Case 4: mean time of recovery using new drug mean time of recovery using existing drug each sample is independently drawn. ( obviously the same patient cannot be both the drugs ) 11-260 DIFFERENCE OF TWO MEANS – INDEPENDENT SAMPLE Case 3. mean wages for a group of females mean wages for a group of men each sample is independently drawn In this case Null hypothesis: Ho : µ(women) = µ(men) Ha: µ(women) ≠ µ(men) Or depending on the problem it could have been a single Tail test. One tail or two tail depends on the Research Question being addressed. What is the implication of acceptance of the Null Hypothesis? It means that both groups belong to the same population i.e. there is only population and hence one mean and variance 11-261 DIFFERENCE OF TWO MEANS – INDEPENDENT SAMPLE Case 3. mean wages for a group of females mean wages for a group of men each sample is independently drawn In this case Null hypothesis: Ho : µ(women) = µ(men) Ha: µ(women) ≠ µ(men) Or depending on the problem it could have been a single Tail test. One tail or two tail depends on the Research Question being addressed. What if Null hypothesis is rejected or not accepted? This would imply that all men belong to a population and all women Belong to a different population and since there are two population; There are also two different means and the variance may be either Equal or unequal. 11-262 DIFFERENCE OF TWO MEANS – INDEPENDENT SAMPLE Case 3. mean wages for a group of females mean wages for a group of men each sample is independently drawn In this case Null hypothesis: Ho : µ(women) = µ(men) Ha: µ(women) ≠ µ(men) Above implies Ho: µ(women) - µ(men) = 0 Ha: µ(women) - µ(men) ≠ 0 Writing the above hypothesis is Step 1. Now we collect samples of the two groups and find out Their wages. Average wage for females and for men are Separately calculated and also sample variances. 11-263 DIFFERENCE OF TWO MEANS – INDEPENDENT SAMPLE In this case Null hypothesis: Ho : µ(women) = µ(men) Ha: µ(women) ≠ µ(men) Above implies Ho: µ(women) - µ(men) = 0 Ha: µ(women) - µ(men) ≠ 0 Writing the above hypothesis is Step 1. Now we collect samples of the two groups and find out Their wages. Average wage for females and for men are Separately calculated and also sample variances. Step2: Step3: Step4: Step5: Decide on the level of significance Decide whether single tail or two tail test Decide on the test statistic. Z for large sample and ‘t’ for small sample. Calculate Z statistic = {x-bar(women)-x-bar(men) - µ(women) - µ(men) } / S.Error Recall that in the course on sampling methods we have indicated the standard error for the difference of two means- independent case. 11-264 DIFFERENCE OF TWO MEANS – INDEPENDENT SAMPLE In this case Null hypothesis: Ho : µ(women) = µ(men) Ha: µ(women) ≠ µ(men) Above implies Ho: µ(women) - µ(men) = 0 Ha: µ(women) - µ(men) ≠ 0 Writing Step2: Step3: Step4: Step5: the above hypothesis is Step 1. Decide on the level of significance Decide whether single tail or two tail test Decide on the test statistic. Z for large sample and ‘t’ for small sample. Calculate Z statistic = {x-bar(women)- x-bar(men) - µ(women) - µ(men) } / S.Error Recall that in the course on sampling methods we have indicated the standard error for the difference of two means- independent case. Step 6: Find Z for alpha level of significance from table Step 7: Compare Z (cal) with Z(alpha) : If Z (cal) ≥ Z(alpha) reject Ho Step 8: Conclude your result. 11-265 DIFFERENCE OF TWO MEANS – INDEPENDENT SAMPLE Example: A firm is interested to understand whether there is any difference in the stress level of employees working in the HR department and in the Marketing department. A random sample of 30 HR employees were considered and their stress level measured as 5.36 in a scale of 10 and from a random sample of 40 marketing personnel showed a stress level of 6.23 in a scale of 10. At 5% level of significance can be conclude that the stress levels are different for the different groups. The variance in the stress levels for HR was 2.3 and that of marketing was 1.87. Step 1: Ho : µ(mktg) = µ(HR) µ(mktg) - µ(HR) = 0 Ha: µ(mktg) ≠ µ(HR) µ(mktg) - µ(HR) ≠ 0 11-266 DIFFERENCE OF TWO MEANS – INDEPENDENT SAMPLE Example: A firm is interested to understand whether there is any difference in the stress level of employees working in the HR department and in the Marketing department. A random sample of 30 HR employees were considered and their stress level measured as 5.36 in a scale of 10 and from a random sample of 40 marketing personnel showed a stress level of 6.23 in a scale of 10. At 5% level of significance can be conclude that the stress levels are different for the different groups. The variance in the stress levels for HR was 2.3 and that of marketing was 1.87. Step 1: Ho : µ(mktg) = µ(HR) µ(mktg) - µ(HR) = 0 Ha: µ(mktg) ≠ µ(HR) µ(mktg) - µ(HR) ≠ 0 Step2: Alpha level is specified as 5% Step 3: This is a two tail test 11-267 DIFFERENCE OF TWO MEANS – INDEPENDENT SAMPLE Example: A firm is interested to understand whether there is any difference in the stress level of employees working in the HR department and in the Marketing department. A random sample of 30 HR employees were considered and their stress level measured as 5.36 in a scale of 10 and from a random sample of 40 marketing personnel showed a stress level of 6.23 in a scale of 10. At 5% level of significance can be conclude that the stress levels are different for the different groups. The variance in the stress levels for HR was 2.3 and that of marketing was 1.87. Step 6: Calculate Z = (6.23-5.66)/ Standard error 11-268 Recall that Standard error for difference of two means Independent sample case is 12 n1 22 n2 If population variance ‘σ’ is unknown use Unbiased estimator ‘s’ – sample variance. 11-269 DIFFERENCE OF TWO MEANS – INDEPENDENT SAMPLE Example: A firm is interested to understand whether there is any difference in the stress level of employees working in the HR department and in the Marketing department. A random sample of 30 HR employees were considered and their stress level measured as 5.36 in a scale of 10 and from a random sample of 40 marketing personnel showed a stress level of 6.23 in a scale of 10. At 5% level of significance can be conclude that the stress levels are different for the different groups. The variance in the stress levels for HR was 2.3 and that of marketing was 1.87. Step 6: Calculate Z = (6.23-5.36)/ Standard error = 0.87/ √{(1.87/40) + (2.3/30)} = 0.87 / 0.35 = 2.486 11-270 DIFFERENCE OF TWO MEANS – INDEPENDENT SAMPLE Example: A firm is interested to understand whether there is any difference in the stress level of employees working in the HR department and in the Marketing department. A random sample of 30 HR employees were considered and their stress level measured as 5.36 in a scale of 10 and from a random sample of 40 marketing personnel showed a stress level of 6.23 in a scale of 10. At 5% level of significance can be conclude that the stress levels are different for the different groups. The variance in the stress levels for HR was 2.3 and that of marketing was 1.87. Step 6: Calculate Z = (6.23-5.36)/ Standard error = 0.87/ √{(1.87/40) + (2.3/30)} = 0.87 / 0.35 = 2.486 Step 7: Z table value ( two tail ) 5% alpha = 1.96 Hence Z ( cal) > Z ( table value) Reject Ho. 2.486 > 1.96 reject Ho 11-271 DIFFERENCE OF TWO MEANS – INDEPENDENT SAMPLE Example: A firm is interested to understand whether there is any difference in the stress level of employees working in the HR department and in the Marketing department. A random sample of 30 HR employees were considered and their stress level measured as 5.36 in a scale of 10 and from a random sample of 40 marketing personnel showed a stress level of 6.23 in a scale of 10. At 5% level of significance can be conclude that the stress levels are different for the different groups. The variance in the stress levels for HR was 2.3 and that of marketing was 1.87. Step 6: Calculate Z = (6.23-5.36)/ Standard error = 0.87/ √{(1.87/40) + (2.3/30)} = 0.87 / 0.35 = 2.486 Step 7: Z table value ( two tail ) 5% alpha = 1.96 Hence Z ( cal) > Z ( table value) Reject Ho. 2.486 > 1.96 reject Ho Step 8: Conclusion: Stress levels are different for marketing and HR. 11-272 11-273 Difference of two means – independent sample small sample If the null hypothesis is accepted, it would imply that both group sample came from the same population and hence for the one population there can be only one mean and one variance. However, if the null hypothesis is rejected it implies that Both group sample belongs to different population and Therefore for each of the population there will be Different mean. But we assume that the two groups Has the same variance. That is homogeneity of variances Is assumed. When such assumption is made the standard Error can be recalled as : 11-274 Difference of two means – independent sample small sample ( homogeneity of variance) Recall that in the case of small sample for independent Samples case; the standard error was calculated Using the pooled estimates as follows: standard error = s(pooled)√{1/n1+1/n2) Where s^2(pooled) = {(n1-1)s1^2} +{(n2-1)s2^2} (n1+n2-2) Or s(pooled) = √ s^2 (pooled) 11-275 Worked example: independent small sample ‘t’ - test A car manufacturer is intending to procure batteries for its given Model from two different vendors. However before procuring they Wish to know if the life of the two batteries would be similar. A Sample of batteries from both the manufacturers are selected Radomly and the life ( in months) was found as follows: Brand A: 38, 37, 42, 44, 36, 39, 40, 41 Brand B: 42, 41, 37, 39, 40, 43, 44, 45, 46, 48, 39 Is there reason to believe that the life of batteries of the two brands are different ? Use 5% level of significance. Step -1: Ho: µ(a) = µ(b) Ha: µ(a) ≠ µ(b) Step 2: Level of significance is known as 5% 11-276 Worked example: independent small sample ‘t’ - test A car manufacturer is intending to procure batteries for its given Model from two different vendors. However before procuring they Wish to know if the life of the two batteries would be similar. A Sample of batteries from both the manufacturers are selected Radomly and the life ( in months) was found as follows: Brand A: 38, 37, 42, 44, 36, 39, 40, 41 Brand B: 42, 41, 37, 39, 40, 43, 44, 45, 46, 48, 39 Is there reason to believe that the life of batteries of the two brands are different ? Use 5% level of significance. Step -1: Ho: µ(a) = µ(b) Ha: µ(a) ≠ µ(b) Step 2: Level of significance is known as 5% Step 3: This is a two tail test as the question is only asking if the life of the two brands of batteries are different. 11-277 Worked example: independent small sample ‘t’ - test A car manufacturer is intending to procure batteries for its given Model from two different vendors. However before procuring they Wish to know if the life of the two batteries would be similar. A Sample of batteries from both the manufacturers are selected Radomly and the life ( in months) was found as follows: Brand A: 38, 37, 42, 44, 36, 39, 40, 41 Brand B: 42, 41, 37, 39, 40, 43, 44, 45, 46, 48, 39 Is there reason to believe that the life of batteries of the two brands are different ? Use 5% level of significance. Step 4: Choosing the test statistic. Since the sample size is 8 and 11 respectively ( small ) a t-statistic is used to infer the hypothesis. Step 5: Calculate the ‘t’ statistic: t= (mean for B/A – mean for B/B)- 0 standard error 11-278 Worked example: independent small sample ‘t’ - test A car manufacturer is intending to procure batteries for its given Model from two different vendors. However before procuring they Wish to know if the life of the two batteries would be similar. A Sample of batteries from both the manufacturers are selected Radomly and the life ( in months) was found as follows: Brand A: 38, 37, 42, 44, 36, 39, 40, 41 Brand B: 42, 41, 37, 39, 40, 43, 44, 45, 46, 48, 39 Is there reason to believe that the life of batteries of the two brands are different ? Use 5% level of significance. Step 5..contd: mean life for brand A = 39.625 mean life for brand B = 42.18182 variance for brand A = 7.125 variance for brand B = 11.3636 pooled variance = 9.6183 t= (39.625-42.18182)-0 / 3.1013√(1/8+1/11) = -2.55682 / 1.44 = -1.7742 11-279 Worked example: independent small sample ‘t’ - test A car manufacturer is intending to procure batteries for its given Model from two different vendors. However before procuring they Wish to know if the life of the two batteries would be similar. A Sample of batteries from both the manufacturers are selected Radomly and the life ( in months) was found as follows: Brand A: 38, 37, 42, 44, 36, 39, 40, 41 Brand B: 42, 41, 37, 39, 40, 43, 44, 45, 46, 48, 39 Is there reason to believe that the life of batteries of the two brands are different ? Use 5% level of significance. Step 5 t= (39.625-42.18182)-0 / 3.1013√(1/8+1/11) = -2.55682 / 1.44 = -1.7742 absolute value = 1.7742 Step 6: tabulated value of ‘t’ for 5% alpha at 17 df = 2.1098 Step 7: Compare absolute values: ‘t’ (cal) < ‘t’(tablulated) hence unable to reject the null hypothesis 11-280 Worked example: independent small sample ‘t’ - test A car manufacturer is intending to procure batteries for its given Model from two different vendors. However before procuring they Wish to know if the life of the two batteries would be similar. A Sample of batteries from both the manufacturers are selected Radomly and the life ( in months) was found as follows: Brand A: 38, 37, 42, 44, 36, 39, 40, 41 Brand B: 42, 41, 37, 39, 40, 43, 44, 45, 46, 48, 39 Is there reason to believe that the life of batteries of the two brands are different ? Use 5% level of significance. Step 5 t= (39.625-42.18182)-0 / 3.1013√(1/8+1/11) = -2.55682 / 1.44 = -1.7742 , absolute value = 1.7742 Step 6: tabulated value of ‘t’ for 5% alpha at 17 df = 2.1098 Step 7: Compare absolute values: ‘t’ (cal) < ‘t’(tablulated) hence unable to reject the null hypothesis Step 8: Conclusion: Hence the mean life of batteries of both the brands are similar and hence both vendors can be considered for selection based on other considerations such as price, delivery etc. 11-281 Worked example: independent small sample ‘t’ - test A car manufacturer is intending to procure batteries for its given Model from two different vendors. However before procuring they Wish to know if the life of the two batteries would be similar. A Sample of batteries from both the manufacturers are selected Radomly and the life ( in months) was found as follows: Brand A: 38, 37, 42, 44, 36, 39, 40, 41 Brand B: 42, 41, 37, 39, 40, 43, 44, 45, 46, 48, 39 Is there reason to believe that the life of batteries of the two brands are different ? Use 5% level of significance. One of the assumptions made to solve this problem Is that the population variances are equal even if The alternative hypothesis is accepted. However, we Have not checked this aspect. Hence it is necessary To do this check this aspect which we shall take up Now. 11-282 CHECKING FOR HOMOGENEITY OF POPULATION VARIANCE HOMOGENEITY OF POPULATION VARAINCE IS CHECKED BY CARRYING A HYPOTHESIS TEST WHICH AS GIVEN BELOW: STEP 1: Ho: σ1^2 = σ2^2 Ha: σ1^2 ≠ σ2^2 Step 2: Decide the level of significance : assume 5% Step 3: this is a two tail test based on the alternative hypothesis Step 4: Decide on the test statistic: For this test which is a ratio of the two sample variances is the ‘F’ test also known as Fisher’s Test: Step 5: Calculate ‘F’ statistic = s1^2/ S2^2 for the previous problem it is = 7.125 / 11.3636 = 0.627 11-283 CHECKING FOR HOMOGENEITY OF POPULATION VARIANCE HOMOGENEITY OF POPULATION VARAINCE IS CHECKED BY CARRYING A HYPOTHESIS TEST WHICH AS GIVEN BELOW: STEP 1: Ho: σ1^2 = σ2^2 Ha: σ1^2 ≠ σ2^2 Step 2: Decide the level of significance : assume 5% Step 3: this is a two tail test based on the alternative hypothesis Step 4: Decide on the test statistic: For this test which is a ratio of the two sample variances is the ‘F’ test also known as Fisher’s Test: Step 5: Calculate ‘F’ statistic = s1^2/ S2^2 for the previous problem it is = 7.125 / 11.3636 = 0.627 Step 6: Read the table value for ‘F’ from the table . This requires degree of freedom for the numerator and denominator which is (n1-1) and (n2-1) i.e. 7 and 10 respectively. 11-284 F- table – how to read. The value of F when the blue shaded portion Is 0.975 , we take the reciprocal of F value Of 0.025 with degree of freedom interchanged Hence F ( 0.025) with 10,7 df = 4.76 And 1/4.76 = 0.21. 11-285 11-286 CHECKING FOR HOMOGENEITY OF POPULATION VARIANCE HOMOGENEITY OF POPULATION VARAINCE IS CHECKED BY CARRYING A HYPOTHESIS TEST WHICH AS GIVEN BELOW: STEP 1: Ho: σ1^2 = σ2^2 Ha: σ1^2 ≠ σ2^2 Step 2: Decide the level of significance : assume 5% Step 3: this is a two tail test based on the alternative hypothesis Step 4: Decide on the test statistic: For this test which is a ratio of the two sample variances is the ‘F’ test also known as Fisher’s Test: Step 5: Calculate ‘F’ statistic = s1^2/ S2^2 for the previous problem it is = 7.125 / 11.3636 = 0.627 Step 6: Read the table value for ‘F’ from the table . This requires degree of freedom for the numerator and denominator which is (n1-1) and (n2-1) i.e. 7 and 10 respectively. Step 7: Now we can see that ‘F’ statistic calculated is in between table value of 0.21 and 3.95. Accept Ho. Step 8 : hence homogeneity of variance is established. 11-287 11-288 Difference of two proportions A candidate who was interested in filing his nomination papers for an Election wanted to understand whether his popularity in two adjacent Constituency was equally population or he had more popularity in a Particular constituency. He then availed the services of a research agency To check his popularity in the two constituency. In this problem we would not be able to check the Difference of two means but difference of two proportions. Two independent samples will be drawn from the two Constituency and find out how many support his candidature. We can take a worked example to explain this test. 11-289 Difference of two proportions A candidate who was interested in filing his nomination papers for an Election wanted to understand whether his popularity in two adjacent Constituency was equally population or he had more popularity in a Particular constituency. He then availed the services of a research agency To check his popularity in the two constituency. Random samples were Drawn from each constituency A and B and preference for his candidature was measured by a survey. In constituency A – sample size 800 and 390 Favored him and in Constituency B – sample size 900 and 490 favored him Can we say that Constituency B is favorable for this candidate. Use 5% Level of significance. Step 1: Ho: p(b) = p(a) Ha: p(b) > p(a) Step 2: level of alpha is given as 5% Step 3: this is a single tail test which is based on the question Step 4: Choose the test statistic: Large sample and hence Z can be used 11-290 Difference of two proportions A candidate who was interested in filing his nomination papers for an Election wanted to understand whether his popularity in two adjacent Constituency was equally population or he had more popularity in a Particular constituency. He then availed the services of a research agencyTo check his popularity in the two constituency. Random samples were Drawn from each constituency A and B and preference for his candidature was measured by a survey. In constituency A – sample size 800 and 390 Favored him and in Constituency B – sample size 900 and 490 favored him Can we say that Constituency B is favorable for this candidate. Use 5% Level of significance. Step 1: Ho: p(b) = p(a) Ha: p(b) > p(a) Step 2: level of alpha is given as 5% Step 3: this is a single tail test which is based on the question Step 4: Choose the test statistic: Large sample and hence Z can be used Step 5: Calculate Z = {p(b)-p(a) }-0 / standard error for difference of two proportion 11-291 Difference of two proportions That the standard error for the difference of two proportions Is given by p1(1 - p1) + p2(1 - p2) n1 n2 Hence in this problem p(a) = 390 / 800 = 0.4875 or 48.75% p(b) = 490/900 = 54.44% Hence standard error = √{48.75x51.25/800} x {(54.44x45.56/900} = 2.4246 11-292 Difference of two proportions A candidate who was interested in filing his nomination papers for an Election wanted to understand whether his popularity in two adjacent Constituency was equally population or he had more popularity in a Particular constituency. He then availed the services of a research agencyTo check his popularity in the two constituency. Random samples were Drawn from each constituency A and B and preference for his candidature was measured by a survey. In constituency A – sample size 800 and 390 Favored him and in Constituency B – sample size 900 and 490 favored him Can we say that Constituency B is favorable for this candidate. Use 5% Level of significance. Step 1: Ho: p(b) = p(a) Ha: p(b) > p(a) Step 2: level of alpha is given as 5% Step 3: this is a single tail test which is based on the question Step 4: Choose the test statistic: Large sample and hence Z can be used Step 5: Calculate Z = {p(b)-p(a) }-0 / standard error for porportion = (54.44-48.75) -0 / 2.4246 = 2.347 Step 6: Read the table value of Z at 5% alpha, single tail = 1.645 Step 7: Compare Z ( cal) with Z ( table value ) 2.347> 1.645 reject Ho. 11-293 Difference of two proportions A candidate who was interested in filing his nomination papers for an Election wanted to understand whether his popularity in two adjacent Constituency was equally population or he had more popularity in a Particular constituency. He then availed the services of a research agencyTo check his popularity in the two constituency. Random samples were Drawn from each constituency A and B and preference for his candidature was measured by a survey. In constituency A – sample size 800 and 390 Favored him and in Constituency B – sample size 900 and 490 favored him Can we say that Constituency B is favorable for this candidate. Use 5% Level of significance. Step 1: Ho: p(b) = p(a) Ha: p(b) > p(a) Step 2: level of alpha is given as 5% Step 3: this is a single tail test which is based on the question Step 4: Choose the test statistic: Large sample and hence Z can be used Step 5: Calculate Z = {p(b)-p(a) }-0 / standard error for porportion = (54.44-48.75) -0 / 2.4246 = 2.347 Step 6: Read the table value of Z at 5% alpha, single tail = 1.645 Step 7: Compare Z ( cal) with Z ( table value ) 2.347> 1.645 reject Ho. Step 8: Conclusion: Constituency B is more popular than A for this candidate. 11-294 Analysis of variance Consider the following research question: Three different types of seeds are sown in a exactly similar Types of soil and the same type of fertilizer is added for Each of the type of plants. The yield for the product is As follows: Yield ( million tons) seed A seed B Seed C plot-1 8 12 9 plot-2 9 10 10 plot-3 10 10 9 plot-4 9 13 8 plot-5 9 11 8 If all the seed varieties are similar they should given Similar yield and if some of them are superior then One or more type would give a larger yield. 11-295 Analysis of variance Consider the following research question: Three different types of seeds are sown in a exactly similar Types of soil and the same type of fertilizer is added for Each of the type of plants. The yield for the product is As follows: Yield ( million tons) seed A seed B Seed C plot-1 8 12 9 plot-2 9 10 10 plot-3 10 10 9 plot-4 9 13 8 plot-5 9 11 8 Step 1: Ho: µa = µb = µc Ha: at least two means are unequal In this case a direct comparison is not feasible as which two can be compared. You need 3 different comparisons A with B , A with C and B with C. In this case the type I error would be very large . Hence We must adopt another method. 11-296 Basics of ANOVA 11-297 Variance calculation A) Calculate the correction factor (c/f)= GT^2/Total sample B) Calculate the total sum of squares ( TSS) = square each value and sum it up – c/f C) Calculate sum of squares between samples (SSB) = (total for seedA)^2/n(a) + (total for seedB)^2/n(B)+ (total for seedC)^2/n(c) – c/f In the problem stated above the values for each of these Are as follows: c/f = (145^2/15)= 1401.66 TSS= 1431-1401.66= 29.34 SSB= 1419.4-1401.66 = 17.74 Now we can construct the ANOVA table 11-298 ANOVA table ( step 2) Source of Sum of variation squares Degree Mean of square freedom F (cal) Between SSB treatment K-1 MSB/ MSW Within SSW treatment n(a)+n( MSW = b)+n(c)- SSW/df k Total n(a) + n(b) + n(c)-1 TSS= SSB+SSW MSB= SSB/df F ( table value) 11-299 ANOVA table ( filled in table ) Source of Sum of variation squares Degree Mean of square freedom F (cal) F ( table value) Between 17.74 treatment 2 8.866 9.178 6.93 Within 11.6 treatment 12 0.966 Total 14 29.34 Mean Square is the variance . Hence MSB =variance between seeds MSW= variance within seeds 11-300 ANOVA Step 2: Level of alpha is to be decided Step 3: This is always a single tail test. Step 4: Since ratio of variances are being considered hence it is an ‘F’ – test of Fisher’s test. Step 5: Calculate the F( cal) as stated earlier Step 6: Read the F – table value for alpha level and df for between and within Step 7: Compare F(cal) with F(table value) If F(cal) ≥ F(table value ) Reject Ho Step 8: Conclusion. 11-301 ANOVA table ( filled in table ) Source of variation Sum of squares Degree of freedom Mean square F (cal) F ( table value) Between treatment 17.74 2 8.866 9.178 6.93 Within treatment 11.6 12 0.966 Total 29.34 14 Hence F(cal) > F( table value) Reject Ho Conclusion: The three types of seeds do not give the same yield. 11-302 Significance testing for correlation coefficient Let us recall from the previous course on Sampling methods Where we had calculated the correlation coefficient based On sample information. The sample information contained only a few values for The independent variable and a corresponding few values For the dependent variable. Hence if the correlation exists For these values, then how can we be sure that if all the Values in the population are known, then a correlation will Exist. To answer this question, it is necessary to test The significance of the correlation coefficient. The procedure is discussed below: 11-303 Significance testing for correlation coefficient Step 1. Define the Null and Alternative hypothesis: Ho: ρ = 0 Ha: ρ ≠ 0 ρ = population correlation Step 2: Decide on the level of alpha ( type I error) let us say it is 5% Step 3: This is a two tail test based on the sign of the alternative hypothesis. Step 4: Decide on the test statistic: Since the sample is usually small we use a ‘t’ – test. 11-304 Significance testing for correlation coefficient Step 1. Define the Null and Alternative hypothesis: Ho: ρ = 0 Ha: ρ ≠ 0 ρ = population correlation Step 2: Decide on the level of alpha ( type I error) let us say it is 5% Step 3: This is a two tail test based on the sign of the alternative hypothesis. Step 4: Decide on the test statistic: Since the sample is usually small we use a ‘t’ – test. Step5: t= (r-ρ) / standard error for correlation standard error = √{(1-r^2)/(n-2)} where ‘n’ = number of pairs (x,y) of sample data Step6: Read table value of ‘t’ for significance level, and degree of freedom ( n-2) 11-305 Significance testing for correlation coefficient Step 1. Define the Null and Alternative hypothesis: Ho: ρ = 0 Ha: ρ ≠ 0 ρ = population correlation Step 2: Decide on the level of alpha ( type I error) let us say it is 5% Step 3: This is a two tail test based on the sign of the alternative hypothesis. Step 4: Decide on the test statistic: Since the sample is usually small we use a ‘t’ – test. Step5: t= (r-ρ) / standard error for correlation standard error = √{(1-r^2)/(n-2)} where ‘n’ = number of pairs (x,y) of sample data Step6: Read table value of ‘t’ for significance level, and df ( n-2) Step 7: Compare ‘t’(cal) with ‘t’ (table value) If t(cal) ≥ t(table value) Reject Ho Step 8: Conclude whether the correlation exists in the population 11-306 Worked example: test for correlation Consider a sample data collected on the number of Hours study done by student and the marks obtained by student. Data is as follows: Student hours of study /day marks obtained(%) a 12 63 b 10 68 c 8 53 d 9 60 e 15 75 f 14 80 g 11 68 h 13 53 11-307 Worked example: test for correlation Consider a sample data collected on the number of Hours study done by student and the marks obtained by student. Data is as follows: Student hours of study /day marks obtained(%) a 12 63 b 10 68 c 8 53 d 9 60 e 15 75 f 14 80 g 11 68 h 13 53 Correlation coefficient ‘r’ ( sample) = 0.613 You can refer to the lectures on Sampling methods for Getting the details of how to calculate this value. 11-308 Worked example: test for correlation Step 1: Ho: ρ = 0 Ha: ρ ≠ 0 Step 2: Assume alpha level is 5% Step 3: This is a two tail test Step 4: Decide test statistic which is ‘t’ in this case Step 5: calculate ‘t’ = (0.613-0)/√(1-0.613^2)/6 = 0.613/0.3225 = 1.90 Step 6: ‘t’ ( table value) at 5% two tail , df=6 = 2.447 Step7: Compare ‘t’ ( cal) with ‘t’ (table value) 1.90 < 2.447 Accept Ho Step 8: Conclusion: There is no significant correlation between number of hours of study and marks obtained. 11-309 Worked example: test for correlation Step 1: Ho: ρ = 0 Ha: ρ ≠ 0 Step 2: Assume alpha level is 5% Step 3: This is a two tail test Step 4: Decide test statistic which is ‘t’ in this case Step 5: calculate ‘t’ = (0.613-0)/√(1-0.613^2)/6 = 0.613/0.3225 = 1.90 Step 6: ‘t’ ( table value) at 5% two tail , df=6 = 2.447 Step7: Compare ‘t’ ( cal) with ‘t’ (table value) 1.90 < 2.447 Accept Ho Step 8: Conclusion: There is no significant correlation between number of hours of study and marks obtained. The above example clearly brings out that even though There is non zero correlation in the sample data but we Cannot conclude in the population that a correlation exists. If the correlation had been a larger value or if the sample Size had been larger and then for the same correlation Coefficient we may have concluded that a correlation exists In the population. This is important to understand 11-310 Testing for the Regression coefficient Recall that we had developed a regression equation to Estimate the dependent variable (y) if we know the value Of the independent variable (x) . y= a + bx where y= dependent variable x = independent variable a = intercept b = regression coefficient (also known as slope) Just like correlation, the regression coefficient is based on Sample data only and in order to use the same to estimate The regression coefficient in the population we need to test Its significance. Population equation would be given as Y = ßo + ß1(X) where ß1= regression coefficient in the population 11-311 Testing for the Regression coefficient Step1: Null and hypothesis: Ho: ß1 = 0 Ha: ß1 ≠ 0 Step2: Decide on the level of alpha Step3: This is a two tail test based on the sign of Ha Step4: Decide on the test statistic . Usually ‘t’ because the sample size would be small Step5: Calculate ‘t’ statistic = (b1-ß1)/standard error(b1) Step6: Read the table value of ‘t’ for alpha level and df df = n-2 Step7: Compare ‘t’ (cal) with ‘t’ (table value) If ‘t’ (cal) ≥ ‘t’ (table value) Reject Ho Step8: Conclusion. 11-312 Worked example for testing of regression coefficient An icecream vendor wants to determine the sale of his product based on the maximum temperature during the day. He collects data which are as follows: (y) sales(kgs) 223, 252, 230, 195, 185, 170, 272, 222, 215, 235 (x) Temp(c) 27, 30, 31, 28, 26, 23, 32, 29, 28, 30 The equation developed for this y= a + b(x) where a = -76.57 you can use the formula’s b = 10.44 given in the subject on sampling methods 11-313 Understanding the regression equation We first need to understand what is the regression equation That we have developed. Getting a regression equation only means that we have minimized the error in estimating the value of ‘y’ but not made it zero. Therefore for the problem stated earlier we can calculate what Would be the error that would be made if we use the equation Developed and what would be the error if we did not know The equation. If we did not know the equation we would Have used the mean value of ‘y’ to estimate ‘y’ Then the error would have been ∑{y-y(bar)}^2 If we know the equation then the error would be ∑{(y(actual) – y(est)}^2 let us calculate both these values for the problem just given 11-314 Understanding the regression equation Total error =∑{y-y(bar)}^2 = 8440.9 If we know the equation then the error would be Error if equation is known ∑{(y(actual) – y(est)}^2 = 1640.86 This means that an error of Total error – Error still there 8440.9-1640.86 = 6800.04 has been explained because of the regression equation. 11-315 Understanding the regression equation Source of error Sum of squares of error df Mean square F(cal) F( table value) Due to regression 6800.04 1 6800.04 33.15 11.26 Error still remaining (residual error) 1640.86 8 205.10 Total Error 8440.9 Mean square error 9 Error variance 11-316 Understanding the regression equation Square root of the mean square error which is Gives the standard error of the estimate (Se) Se = √ 205.1 = 14.32. Standard error for (b1) = Se / √{∑(x-x(bar))^2} ∑(x-x(bar))^2} = 62.4 √ 62.4 = 7.89 Standard error (b1) = 14.32 / 7.89 = 1.814 11-317 Worked example for testing of regression coefficient An icecream vendor wants to determine the sale of his product based on the maximum temperature during the day. He collects data which are as follows: (y) sales(kgs) 223, 252, 230, 195, 185, 170, 272, 222, 215, 235 (x) Temp(c) 27, 30, 31, 28, 26, 23, 32, 29, 28, 30 Step 1: Ho: =0 Ha: ß1≠0 Step2: Assume an alpha level of 5% Step3: This is a two tail test. Step4: Decide on the test statistic :- ‘t’ in this case Step5: Calculate ‘t’ = (b-ß1) / standard error (b1) (10.44-0) / 1.814 = 5.755 Step 6: Read table value ‘t’ , 5% alpha, df= 8 = 2.306 11-318 Worked example for testing of regression coefficient An icecream vendor wants to determine the sale of his product based on the maximum temperature during the day. He collects data which are as follows: (y) sales(kgs) 223, 252, 230, 195, 185, 170, 272, 222, 215, 235 (x) Temp(c) 27, 30, 31, 28, 26, 23, 32, 29, 28, 30 Step 1: Ho: =0 Ha: ß1≠0 Step2: Assume an alpha level of 5% Step3: This is a two tail test. Step4: Decide on the test statistic :- ‘t’ in this case Step5: Calculate ‘t’ = (b-ß1) / standard error (b1) =(10.44-0) / 1.814 = 5.755 Step 6: Read table value ‘t’ , 5% alpha, df= 8 = 2.306 Step7: ‘t’(cal) > ‘t’(table value) Reject Ho Step8: The regression coefficient calculated in the sample is significant and can be used to find the interval estimate in for the population parameter. 11-319 References used for this subject 1. Statistics for Business and Economics, 5th edition, Paul Newbold, William Carlson & Betty Thorne Prentice Hall Publication. 2. General Statistics : Warren Chase and Fred Bown John Wiley Publication 3. Marketing Research, 5th edition, Naresh Malhotra, Pearson Education Publication 4. Statistics for Business and Economics, 8th edition, Anderson, Sweeney & Williams. Thomson South-Western Publication 5. Complete Business Statistics, 6th edition, Azcel & Sounderpandian Tata McGraw Hill publication. 6. Applied Statistics in Business & Economics, David P. Doane & Lori E. Seward : Tata McGraw Hill