Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Dr. Héctor Allende Review of Probability and Statistics A Review of Probability and Statistics • Descriptive statistics • Probability • Random variables • Sampling distributions • Estimation and confidence intervals • Test of Hypothesis –For mean, variances, and proportions –Goodness of fit 1 Dr. Héctor Allende Review of Probability and Statistics Key Concepts • Population -- "parameters" –Finite –Infinite • Sample -- "statistics" • Random samples - Your MOST important decision! 2 Dr. Héctor Allende Review of Probability and Statistics Data • Deterministic vs. Probabilistic (Stochastic) • Discrete or Continuous: – Whether a variable is continuous (measured) or discrete (counted) is a property of the data, not of the measuring device: weight is a continuous variable, even if your scale can only measure values to the pound. • Data description: – Category frequency – Category relative frequency 3 Dr. Héctor Allende Review of Probability and Statistics Data Types • Qualitative (Categorical) –Nominal -- I E = 1 ; EE = 2 ; –Ordinal -- poor = 1 ; fair = 2 ; CE = 3 good = 3 ; excellent = 4 • Quantitative (Numerical) –Interval -- temperature, viscosity –Ratio -- weight, height • The type of statistics you can calculate depends on the data type. Average, median, and variance make no sense if the data is categorical (proportions do). 4 Dr. Héctor Allende Review of Probability and Statistics Data Presentation for Qualitative Data • Rules: – Each observation MUST fall in one and only one category. – All observations must be accounted for. • Table -- Provides greater detail • Bar graphs -- Consider Pareto presentation! • Pie charts (do not need to be round) 5 Dr. Héctor Allende Review of Probability and Statistics Data Presentation for Quantitative Data • Consider a Stem-and-Leaf Display • Use 5 to 20 classes (intervals, groups). –Cell width, boundaries, limits, and midpoint • Histograms –Discrete –Continuous (frequency polygon - plot at class mark) • Cumulative frequency distribution (Ogive - plot at upper boundary) 6 Dr. Héctor Allende Review of Probability and Statistics Statistics • Measures of Central Tendency – Arithmetic Mean – Median – Mode – Weighted mean • Measures of Variation – Range – Variance – Standard Deviation • Coefficient of Variation • The Empirical Rule 7 Dr. Héctor Allende Review of Probability and Statistics Arithmetic Mean and Variance -- Raw Data • Mean n _ y y i i 1 n • Variance S 2 y y i _ n 1 2 n y 2 i y 2 i n n1 8 Dr. Héctor Allende Review of Probability and Statistics Arithmetic Mean and Variance -- Grouped Data • Mean n f y _ y • Variance S f 2 where i _ y y i n 1 nf i i i 1 i n 2 n f y f y n n 1 2 i and i i 2 i y = class midpoint i 9 Dr. Héctor Allende Review of Probability and Statistics Percentiles and Box-Plots • 100pth percentile: value such that 100p% of the area under the relative frequency distribution lies below it. – Q1: lower quartile (25% percentile) – Q3: upper quartile (75% percentile) • Box-Plots: limited by lower and upper quartiles – Whiskers mark lowest and highest values within 1.5*IQR from Q1 or Q3 – Outliers: Beyond 1.5*IQR from Q1 or Q3 (mark with *) – z-scores - deviation from mean in units of standard deviation. Outlier: absolute value of z-score > 3 10 Dr. Héctor Allende Review of Probability and Statistics Probability: Basic Concepts • Experiment: A process of OBSERVATION • Simple event - An OUTCOME of an experiment that can not be decomposed – “Mutually exclusive” – “Equally likely” • Sample Space - The set of all possible outcomes • Event “A” - The set of all possible simple events that result in the outcome “A” 11 Dr. Héctor Allende Review of Probability and Statistics Probability • A measure of uncertainty of an estimate – The reliability of an inference • Theoretical approach - “A Priori” – Pr (Ai) = n/N • n = number of possible ways “Ai” can be observed • N = total number of possible outcomes • Historical (empirical) approach - “A Posteriori” – Pr (Ai) = n/N • n = number of times “Ai” was observed • N = total number of observations • Subjective approach – An “Expert Opinion” 12 Dr. Héctor Allende Review of Probability and Statistics Probability Rules 0 Pr (A i ) 1 Pr (A ) i = 1 i • Multiplication Rule: – Number of ways to draw one element from set 1 which contains n1 elements, then an element from set 2, ...., and finally an element from set k (ORDER IS IMPORTANT!): n1* n2* ......* nk 13 Dr. Héctor Allende Review of Probability and Statistics Permutations and Combinations • Permutations: – Number of ways to draw r out of n elements WHEN ORDER IS IMPORTANT: n! n P r ( n r)! • Combinations: – Number of ways to select r out of n items when order is NOT important n! n C r r! ( n r ) ! 14 Dr. Héctor Allende Review of Probability and Statistics Compound Events Union ( AB){x xAor Borboth} Intersecti on ( AB){x xAand B} Complement ( A') {x xA} 15 Dr. Héctor Allende Review of Probability and Statistics Conditional Probability P( A B ) P( AB) provided P( B)0 P( B) Multiplica tive Rule : P( AB) P( A B)P( B ) provided P( B)0 16 Dr. Héctor Allende Review of Probability and Statistics Other Probability Rules P ( A B ) P ( A) P ( B ) P ( A B ) • Mutually Exclusive Events: P ( A B ) {} • Independence: – A and B are said to be statistically INDEPENDENT if and only if: P( AB) P( A)P( B) 17 Dr. Héctor Allende Review of Probability and Statistics Bayes’ Rule P ( Ai E ) P( A i ) P ( E A i ) P ( A j ) P ( E Aj ) j 18 Dr. Héctor Allende Review of Probability and Statistics Random Variables • Random variable: A function that maps every possible outcome of an experiment into a numerical value. • Discrete random variable: The function can assume a finite number of values • Continuous random variable: The function can assume any value between two limits. 19 Dr. Héctor Allende Review of Probability and Statistics Probability Distribution for a Discrete Random Variable • Function that assigns a value to the probability p(y) associated to each possible value of the random variable y. 0 p ( y) 1 p ( y) 1 y 20 Dr. Héctor Allende Review of Probability and Statistics Poisson Process • Events occur over time (or in a given area, volume, weight, distance, ...) • Probability of observing an event in a given unit of time is constant • Able to define a unit of time small enough so that we can’t observe two or more events simultaneously. • Tables usually give CUMULATIVE values! 21 Dr. Héctor Allende Review of Probability and Statistics The Poisson Distribution x is the number of events observed over T is the expected number of events over T e is the base of natural logs (2.71828) = 2 22 Dr. Héctor Allende Review of Probability and Statistics Poisson Approximation to the Binomial • In a binomial situation where n is very large (n > 25) and p is very small (p < 0.30, and np < 15), we can approximate b(x, n, p) by a Poisson with probability ( lambda = np) np y n y e ( n p) n y b ( y , n, p ) p ( 1 p ) P ( y , n p ) y y! 23 Dr. Héctor Allende Review of Probability and Statistics Probability Distribution for a Continuous Random Variable • F( y0 ), is a cumulative distribution function that assigns a value to the probability of observing a value less or equal to y0 F ( y ) P ( y y ) y f ( y ) dy 0 0 0 Property: F ( y ) is continuous over y 24 Dr. Héctor Allende Review of Probability and Statistics Probability Calculations P ( a y b ) ab f ( y ) dy where f ( y ) is the density function of y f ( y ) d[F ( y ) ] dy F( y )is the(probability) distribution function of y f ( y )0 y f ( y ) dy 1 F ( y ) iscontinuous P( y a) 0 for allcontinuous r.v.( a constant ) 25 Dr. Héctor Allende Review of Probability and Statistics Expectations E ( y ) yp y discrete all y E ( y ) y f ( y ) dy continuous E [ g ( y ) ] g ( y ) f ( y ) dy 2 Variance E [ ( y )2 ] E ( y 2 ) 2 Standard deviation 2 Properties of Expectations E (c ) c E ( cy ) c E ( y ) E [ g ( y ) g ( y ) g ( y ) ] E [ g ( y ) ] E [ g ( y ) ] (c) 0 ( cy ) c ( y ) 1 2 k 1 k 2 2 2 2 26 Dr. Héctor Allende Review of Probability and Statistics The Uniform Distribution (a b) ( b a )2 2 2 12 A frequently used model when no data are available. 27 Dr. Héctor Allende Review of Probability and Statistics The Triangular Distribution A good model to use when no data are available. Just ask an expert to estimate the minimum, maximum, and most likely values. 28 Dr. Héctor Allende Review of Probability and Statistics The Normal Distribution z y the standard normal variable Tables provide cumulative values for the Standard Normal Distribution N ( = 0, = 1 ) 29 Dr. Héctor Allende Review of Probability and Statistics The Lognormal Distribution Consider this model when 80 percent of the data values lie in the first 20 % of the variable’s range. 30 Dr. Héctor Allende Review of Probability and Statistics The Gamma Distribution Properties: 2 2 31 Dr. Héctor Allende Review of Probability and Statistics The Erlang Distribution A special case of the Gamma Distribution when = k = integer A Poisson process where we are interested in the time to observe k events 32 Dr. Héctor Allende Review of Probability and Statistics The Exponential Distribution A special case of the Gamma Distribution when =1 33 Dr. Héctor Allende Review of Probability and Statistics The Weibull Distribution A good model for failure time distributions of manufactured items. It has a closed expression for F ( y ). 34 Dr. Héctor Allende Review of Probability and Statistics The Beta Distribution A good model for proportions. You can fit almost any data. However, the data set MUST be bounded! 35 Dr. Héctor Allende Review of Probability and Statistics Bivariate Data (Pairs of Random Variables) • Covariance: measures strength of linear relationship • Correlation: a standardized version of the covariance Cov( X , Y ) E X E X Y E Y E XY E X E Y Cov X , Y • Autocorrelation: For a single time series: X Y Relationship between an observation and those immediately preceding it. Does current value (Xt) relate to itself lagged one period (Xt-1)? 36 Dr. Héctor Allende Review of Probability and Statistics Sampling Distributions The population has PARAMETERS , _ A sample yields STATISTICS X, S 2 A statistics is calculated based on the values observed in a sample. Those values are random variables. Therefore, a statistics is a RANDOM VARIABLE. The sampling distribution of a statistic is its probability distribution. The STANDARD ERROR of a statistic is the standard deviation of its sampling distribution. See slides 8 and 9 for formulas to calculate sample means and variances (raw data and grouped data, simultaneously). 37 Dr. Héctor Allende Review of Probability and Statistics The Sampling Distribution of the Mean (Central Limit Theorem) The CENTRAL LIMIT THEOREM: If random samples of size n are taken from a population having ANY distribution with mean and standard deviation , then, when n is large enough, the sample distribution of the mean can be approximated by a normal density with mean and standard deviation _ Y _ Y n 38 Dr. Héctor Allende Review of Probability and Statistics The Sampling Distribution of Sums Let L a y a y ..... a y 1 Assume 1 2 2 k E (y ) , i k Var ( y ) i i , Cov ( y , y ) 2 i i j ij Then L possesses a normal density with mean and variance: E ( L ) a a ..... a 1 1 2 2 k k Var ( L ) a a ..... a 2 2 2 2 2 2 1 1 2 2 k k 2 a a 2 a a ..... 2 a 1 2 12 1 3 13 k 1 a k k 1, k 39 Dr. Héctor Allende Review of Probability and Statistics Distributions Related to Variances For a sample with standard deviation S, the statistics ( n 1) S follows a Chi square distr. with n 1. 2 2 2 For two independent samples, the statistics / F follows an F distribution with parameters / 2 1 1 2 2 2 in the numerator and in the denominator. 1 2 The sum of two chi - squares follows a chi - square distribution with = 1 2 40 Dr. Héctor Allende Review of Probability and Statistics The t Distribution Let z be a standard normal variable and be a chi - square 2 random variable with degrees of freedom. If z and are z independent, then t = is said to posses a / 2 2 Student' s distribution (" t - distribution" ) with df. COROLLARY: For a random sample taken from a y- normal population, t = follows a t distribution S/ n with df . 41 Dr. Héctor Allende Review of Probability and Statistics Estimation • Point and Interval Estimators • Properties of Point Estimators – Unbiased: E (estimator) = estimated parameter Note: S2 is Unbiased if E Y _ – MVUE: Minimum Variance Unbiased Estimators • Most frequently used method to estimate parameters: MLE - Maximum Likelihood Estimators. 42 Dr. Héctor Allende Review of Probability and Statistics Interval Estimators -- Large sample CI for mean From the Central Limit Theorem: X Prob -z z / n _ /2 /2 1 After some algebraic manipulation we get: Prob X z X z 1 n n _ _ /2 The ( 1 - /2 ) * 100% Confidence Interval for 43 Dr. Héctor Allende Review of Probability and Statistics Interval Estimators -- Small sample CI for mean For small samples( n < 30 ): X Prob - t t 1 S/ n _ /2 /2 After some algebraic manipulation we get: S S Prob X t X t 1 n n _ _ /2 The ( 1 - /2 ) * 100% Confidence Interval for (small samples) 44 Dr. Héctor Allende Review of Probability and Statistics Sample Size Based on CI for the mean: z / 2 n 2 z / 2 S 2 Recommendation: Sample approximately 30 Estimate 2 using S 2 Estimate n Take more observations as needed. 45 Dr. Héctor Allende Review of Probability and Statistics CI for proportions (large samples) The distribution of a proportion is fairly normal with mean = p and variance 2 p 1 p n Then, the C. I. for the population proportion is: ^ ^ p p z ^ where p /2 ^ p (1 p ) n y is the observed proportion of successes n Assumption: The interval does not contain 0 or 1. 46 Dr. Héctor Allende Review of Probability and Statistics Sample Size (proportions) Based on CI for a proportion: ^ z / 2 ^ n p 1 p 2 Recommendation: Sample approximately 30 ^ Estimate p Estimate n Take more observations as needed. 47 Dr. Héctor Allende Review of Probability and Statistics CI for the variance The statistics: n 1 S 2 2 ~ A Chi - Square distr. with = n - 1 2 After some algebraic manipulation: n 1 S Prob 2 /2, 2 n 1 S 2 2 ( 1 / 2 ), 2 1 Assumption: Population is approximately normal. 48 Dr. Héctor Allende Review of Probability and Statistics CI for the Difference of Two Means -- large samples -The difference of two means follows a normal density with: E Y Y _ _ 1 2 1 and Var Y Y n n _ 2 _ 1 2 2 1 2 2 1 C.I. for 1 = Y Y z _ 2 _ 1 2 /2 2 1 n 1 Y Y z _ 1 2 /2 2 2 n 2 2 _ 2 S S n n 2 1 2 1 2 Assumptions: Independent samples with more than 30 observations each. 49 Dr. Héctor Allende Review of Probability and Statistics CI for (p1 - p2) --- (large samples) For large samples ( n1 and n2 30) : ^ ^ ^ ^ p 1 p 2 z / 2 p p p 1 p 2 z / 2 ^ ^ 1 2 ^ ^ ^ ^ p 1 q1 p 2 q 2 n1 n2 Approximation is good as long as neither interval includes 0 or 1. 50 Dr. Héctor Allende Review of Probability and Statistics CI for the Difference of Two Means -- small samples, same variance -= Y1 Y2 t / 2, n n 2 S p _ C.I. for 1 2 where S 2p _ 1 n 1 1 S12 n2 1 S22 ( n1 n2 1) 2 1 1 n 1 n2 ("pooled variance") Assumptions: 1. Independent samples taken from normal populations. 2. Variances are unknown but equal ( 12 22 2 ) 51 Dr. Héctor Allende Review of Probability and Statistics CI for the Difference of Two Means -small samples, different variances- C.I. for 1 2 = Y1 Y2 t / 2, _ S1 S 2 n1 n2 2 and = 2 S1 n1 n1 1 2 2 S 12 S22 n 1 n2 2 S2 n2 n2 1 2 _ 2 (round down) Assumptions: Independent samples taken from normal populations. 52 Dr. Héctor Allende Review of Probability and Statistics CI for the Difference of Two Means -- matched pairs -We have PAIRS of observations related through some common factor (Y1i , Y2i ): Let d i Y1i Y2i _ the observed difference for pair i C.I. for d d t / 2 , n 1 Sd n _ where d and Sd are the mean and the standard deviation of the n sample differences. Assumptions: Random observations; the population of paired differences is normally distributed. 53 Dr. Héctor Allende Review of Probability and Statistics CI for two variances Recall: 2 n 1 S 2 S 1 1 1 / n 1 2 1 / 2 2 1 1 1 F 1 ~F n 1 S 2 2 / S2 n 1, n 1 2 2 2 2 / n 1 2 1 2 2 2 2 2 2 After some algebraic manipulation: Prob F n 1, 1 2 1 (1 ) 2 F 2 n 1, / 2 n 1 , n 1 , (1 / 2) 2 1 2 S2 1 S2 2 S2 1 S2 2 Assumption: Independent samples from normal populations. 54 Dr. Héctor Allende Review of Probability and Statistics Prediction Intervals Consider the prediction of the value for the NEXT observation (not the mean value but its actual value), e.g., we want a "confidence interval" for y . n +1 Consider the difference between this observation and the sample mean: E y y E ( y ) E ( y ) _ _ n 1 n +1 y n +1 0 _ y 1 y y ( y ) ( y ) 1 n n If the distribution of y is approximately normal, this difference will also be normal. This yields the following " prediction interval" for the next observation, y : _ 2 2 _ 2 2 2 2 n 1 n +1 n +1 Pr y t _ / 2,n 1 1 S 1 y n _ n1 y t / 2,n 1 1 S 1 1 n 55 Dr. Héctor Allende Review of Probability and Statistics Hypothesis Testing • Elements of a Statistical Test. Focus on decisions made when comparing the observed sample to a claim (hypotheses). How do we decide whether the sample disagrees with the hypothesis? • Null Hypothesis, H . A claim about one or more 0 population parameters. What we want to REJECT. • Alternative Hypothesis, H : What we test against. a Provides criteria for rejection of H0. • Test Statistic: computed from sample data. • Rejection (Critical) Region, indicates values of the test statistic for which we will reject H0. 56 Dr. Héctor Allende Review of Probability and Statistics Errors in Decision Making True State of Nature H0 Ha Decision Dishonest client Honest client Do not lend Correct decision Type II error Type I error Correct decision Lend 57 Dr. Héctor Allende Review of Probability and Statistics Statistical Errors Type I error ( ): Rejecting a true Null Hypothesis (producer' s risk) Type II error ( ): Rejecting a true Alternative Hypothesis (consumer' s risk) Power of a statistical test, ( 1 - ), is the probability of rejecting the null hypothesis H when, in fact, H is false. 0 0 58 Dr. Héctor Allende Review of Probability and Statistics Statistical Tests One - tailed tests: H: Rejection region: 0 0 H: z > z a (o r < ) (or z z ) 0 0 Two - tailed tests: H: Rejection region: 0 0 H: z > z a 0 /2 or z z /2 _ where z = X / n 0 and P(z > z ) 59 Dr. Héctor Allende Review of Probability and Statistics The Critical Value The sample size for specified and when testing H 0 : = 0 versus H a : a is given by n = z z a 2 2 0 2 Assumption: is the same under both hypotheses. 60 Dr. Héctor Allende Review of Probability and Statistics The observed significance level for a test It is standard in industry to use = 0.05. Some researchers prefer to report the observed " p - value". This is the probability (under H ) of observing the value of the test statistic. This allows the reader to make his (her) own decision 0 about accepting or rejecting H . Most computer packages report the significance as 0 (for example) Prob > T 61 Dr. Héctor Allende Review of Probability and Statistics Testing proportions (large samples) ^ H 0 : p p0 test statistic: z = p p0 p 0 ( 1 p0 ) n y p is the observed proportion of successes n ^ where Rejection region (example): z > z ( H a : p p0 ) ^ ^ ^ Assumption: The interval p 2 p (1 p) / n does not contain 0 or 1. 62 Dr. Héctor Allende Review of Probability and Statistics Testing a Normal Mean Select . Set your test as one - tailed or two - tailed. y y Calculate test statistic: z = / n S/ n Compare to the critical value (from book' s table). _ _ 0 0 If sample is small ( n < 30 ): _ y Calculate test statistic: t = S/ n (ass umes an approximately normal population) 0 63 Dr. Héctor Allende Review of Probability and Statistics Testing a variance H 0 : 2 20 test statistic: 2 n 1 S 2 for H : for H : 2 0 Rejection region: 2 2 2 2 2 2 2 /2 2 a 0 2 1 or 2 2 1 /2 a 2 0 for H a : 2 20 Assumption: Population is approximately normal. 64 Dr. Héctor Allende Review of Probability and Statistics Testing Differences of Two Means -- large samples -- H 0 : 1 2 D0 Y Y D 2 0 1 test statistic: z S12 S22 n1 n2 _ _ Rejection region: z > z if H a : 1 2 D0 z < - z if H a : 1 2 D0 z > z / 2 or z z / 2 if H a : 1 2 D0 Assumptions: Independent samples with more than 30 observations each. 65 Dr. Héctor Allende Review of Probability and Statistics Testing Differences of Two Means -- small samples, same variance - Y Y D 2 0 1 _ H 0 : 1 2 D0 test statistic: t _ 1 1 Sp n1 n2 Rejection region (example): t > t , n n 2 ( H a : 1 2 D0 ) 1 where S 2p n 1 2 1 S12 n2 1 S22 ("pooled variance") ( n1 n2 1) Assumptions: 1. Indep. samples from normal populations. 2. Variances are unknown but equal ( 12 22 2 ) 66 Dr. Héctor Allende Review of Probability and Statistics Testing Differences of Two Means -small samples, different variancesH0 : 1 2 D0 test statistic: t Y_ Y_ D 1 2 0 S12 S22 n1 n2 Rejection region (example): t > t , ( Ha : 1 2 D0 ) where = S12 S22 n1 n2 2 2 2 S S n1 n 2 n1 1 n2 1 2 1 2 2 (round down) Assumptions: Independent samples taken from approximately normal populations. 67 Dr. Héctor Allende Review of Probability and Statistics Testing Difference of Two Means -- matched pairs -- We have PAIRS of observations related through some common factor (Y1i , Y2i ): Let d i Y1i Y2i the observed difference for pair i _ d D0 1 2 D0 test statistic: t = Sd / n H 0 : diff Rejection region: t > t , n 1 for H a : diff 1 2 D0 _ where d and S d are the mean and the standard deviation of the n sample differences. Assumptions: Random observations; the population of paired differences is normally distributed. 68 Dr. Héctor Allende Review of Probability and Statistics Testing a ratio of two variances H: 1 ( e. g., ) 2 0 1 2 2 2 1 2 2 test statistic: F = larger sample variance smaller sample variance Rejection region: F > F F > F / 2 for H a : 12 22 for H a : 12 22 Assumption: Independent samples from normal populations. Note: Make sure the df in the numerator are those of the sample with larger variance! 69 Dr. Héctor Allende Review of Probability and Statistics Testing (p1 - p2) --- (large samples) For large samples ( n1 and n2 30) : H 0 : p1 p2 D0 test statistic: z = ^ when D0 0 when D0 0 ^ p^ p^ 2 1 p^ p^ 2 1 ^ ^ ( p1 p 2 ) D0 1 1 p q n1 n2 ^ p^ p^ 2 1 ^ p1 q 1 p 2 q 2 n1 n2 ^ ^ y1 y 2 and p n1 n2 ^ Approximation is good as long as no interval includes 0 or 1. 70 Dr. Héctor Allende Review of Probability and Statistics Categorical Data One - way Table: Categories and their frequencies: Categ. 1 2 .. k Freq. n1 n2 .. nk ^ Large sample conf. int. for pi p i z / 2 Example: EE 17 Then ME 11 Others 9 Total n ^ 1 ^ p i 1 pi n Total 37 17 1 17 20 196 . . 0.46 016 37 37 37 37 0.30 pEE 0.62 pEE 71 Dr. Héctor Allende Review of Probability and Statistics One-way Tables (Cont.) Large sample (1 - ) 100 % Conf. Int. for pi p j : p i p j ( p i p j ) z / 2 ^ ^ ^ ^ ^ ^ ^ 1 ^ p i (1 p i ) p j (1 p j ) 2 p i p j n In the example: pEE pME 17 11 1 17 20 11 26 17 11 196 . 2 37 37 37 37 37 37 37 37 37 0162 . 0.275 0113 . pEE pME 0.437 NOTE: 0.045 pEE pOthers 0.477 NOT significant! again, difference is NOT significant! 72 Dr. Héctor Allende Review of Probability and Statistics Categorical Data Analysis General r x c Contingency Table 1 2 .. 1 n(1,1) n(1,2) .. 2 n(2,1) n(2,2) .. .. .. .. .. r n(r,1) n(r,2) .. Totals c(1) c(2) .. c n(1,c) n(2,c) .. n(r,c) c(c) Totals r (1) r (2) .. r (r) n 73 Dr. Héctor Allende Review of Probability and Statistics Example of a Contingency Table STA 3032 - Summer 1994 Grade Q2 Q4 0-2 13 0 2.1-4 6 1 4.1-6 8 5 6.1-8 4 7 8.1-10 2 16 Total 33 29 Q6 Total 2 1 11 9 6 29 15 8 24 20 24 91 74 Dr. Héctor Allende Review of Probability and Statistics Testing for Independence H : Variables are independent 0 H : They are not a 2 n 2 nij E nij c r c r ij n Test statistic: 2 1 j 1i 1 ri c j j 1i 1 En ij where r c i j En ij n Rejection region: 2 0.05, (r - 1) (c - 1) Note: regroup rows (columns) as needed for E n 5 i , j. ij 2 2 192 1 6 2 In the example: 91 ... 1 4133 . 23 33 23 29 24 29 Note regrouping! Compare to 12.5916 ( from Table) 0.05, 6 Conclusion: Variables are NOT independent. 75 Dr. Héctor Allende Review of Probability and Statistics Distributions: Model Fitting Steps Collect data. Make sure you have a random sample. You will need at least 30 valid cases Plot data. Look for familiar patterns Hypothesize several models for distribution Using part of the data, estimate model parameters Using the rest of the data, analyze the model’s accuracy Select the “best” model and implement it Keep track of model accuracy over time. If warranted, go back to 6 (or to 3, if data (population?) behavior keeps changing) 76 Dr. Héctor Allende Review of Probability and Statistics Chi-Square Test of Goodness of Fit H 0 : p1 p10 ; p2 p20 ; .... ; pk pk 0 with pi pi 0 1 i i H a : At least one pi pi 0 yi Let n = sample size and p i the observed frequency in cell i n Make sure that e i npi 5 i (if not, regroup cells as needed). ^ Test Statistic: 2 Rejection Region: k i 1 ni ei ei 2 k i 1 ni npi 0 2 npi 0 2 2 where: = k - r -1 k = number of cells after regrouping r = number of parameters estimated from data to calculate pi0 77 Dr. Héctor Allende Review of Probability and Statistics Kolmogorov-Smirnov Test of Goodness of Fit Compares the empirical distribution function F ( y ) with n a hypothesized theoretical distribution function F ( y ). Empirical: F ( y ) = fraction of the sample less or equal to y i = for the ith ranked observation (contains y) n 1 Let D max F ( y ) n i 1 D max F ( y ) n Then D = max F ( y ) F ( y ) max ( D , D ) n i i n Critical values given in tables 78 Dr. Héctor Allende Review of Probability and Statistics A Review of Probability and Statistics • Descriptive statistics • Probability • Random variables • Sampling distributions • Estimation and confidence intervals • Test of Hypothesis –For mean, variances, and proportions –Goodness of fit 79