Download Terms

Stats 156 – Terms Chapter 1 Population = the entire collection of individuals or objects about which information is desired Sample = a subset of the population selected for study in some way Categorical Data = data that is qualitative in nature Numerical Data = data that is quantitative in nature Continuous = the data is made up of some interval on the number line Discrete = the data is a collection of separated points on the number line Chapter 2 Bias = tendency for samples to differ from corresponding population Selection Bias = some part of the population is systematically excluded from the sample Measurement Bias = the method of observation produces values that differ from the population Non-Response Bias = responses are not obtained from all members of the sample ** Selection Bias is the most common Extraneous Factor = a factor not of interest in the current study, but is thought to affect the response variable in some way Confounded Factors = two factors whose effects on the response variable cannot be distinguished from one another in any way Chapter 3 Relative Frequency = the fraction of proportion of the time that a particular category is in the data set Frequency Distribution = a table that displays the possible categories along with the associated frequencies, relative frequencies, cumulative frequencies, and relative cumulative frequencies Outlier = an unusually small or large data value Dot Plot, Stem & Leaf, Histogram Chapter 4 Mean = average of the observation values Median = the middle value when the data is arranged sequentially Mode = the observation value occurring most often Standard Deviation = a “typical” deviation from the mean Lower Quartile (Q1) = the median of the lower half of the data Upper Quartile (Q3) = the median of the upper half of the data Interquartile Range (IQR) = upper quartile – lower quartile Outlier = a sample observation that lies more than 1.5(IQR) away from the upper or lower quartile, or an observation that has |z-score| > 2 Chebyshev’s Rule = the percentage of observations that are within K standard deviations of the mean is at least 100 1  1 2  % k   So at least 75% are within 1 standard deviation At least 88.9% are within 2 standard deviations At least 93.75% are within 3 standard deviations … Empirical Rule = If the data is well represented by a normal curve, then 68% of the observations are within 1 standard deviation of the mean, 95% are within 2, and 99.7% are within 3 Z-Score = how many standard deviations an observation is from the mean 5 Number Summary of Data Set = min, Q1, median, Q3, max Chapter 5 Residual = difference between observed value and regression model value (Pearson’s) Correlation Coefficient = measures how closely the data fall on a straight line Coefficient of Determination = tells what percent of the variation can be explained by the linear relationship between x and y Chapters 6 & 7 Chance Experiment = any activity or situation in which there is uncertainty about which of two or more possible outcomes will result Probability of Event E (P(E)) = the ratio of the number possible ways to get E to the total number of possible outcomes Properties of Probability 1. For any event E, 0  P( E )  1 2. If S is the entire sample space, P(S) = 1 3. If events E and F are disjoint, then PE  F   P( E )  P( F ) otherwise PE  F   P( E )  P( F )  PE  F  4. For any event E, P( E )  1  P( E ) Random Variable = a numerical variable whose value depends on the outcome of a chance experiment Discrete = possible outcomes are isolated points on a number line Continuous = possible outcomes are an entire interval on the number line Probability Distribution (discrete x) = probability of each possible outcome of the random variable x Probability Distribution (continuous x) = a function (sometimes called the density function of the variable x) with the properties that f ( x)  0 for all x and In this case, Pa  x  b    b a    f ( x)dx  1 f ( x)dx Normal Distribution = any probability distribution that has a bell shape curve as its density function Standard Normal Distribution = the normal distribution (bell shapes curve) with mean   0 and standard deviation   1 *Calculator can find probabilities for normal distributions with mean  and standard deviation  Chapter 8 Statistic = any value computed from values in a sample Sampling Distribution = the probability distribution for a statistic x Sampling x Sampling = for samples of size n in a population, we find x for each sample and the variable x becomes the random variable of interest General Properties of an x Sampling Distribution Let x denote the mean of the observations in a random sample of size n from a population having mean  and standard deviation . 1.  x   2.  x2  2 n  x   n 3. When the population is normal, the sampling distribution of x is also normal for any size n 4. For large values of n n  30  , the sampling distribution of x is approximately normal regardless of whether or not the population itself is normal = Central Limit Theorem Proportion Sampling S = Success = an individual or object has a specific property that is under investigation F = Failure = does not have the property  = the proportion of successes in the entire population General Properties of a Proportion Sampling Distribution Let p be the proportion of successes in a random sample of size n from a population whose proportion of S’s is . Denote the mean value of p by  p and the standard deviation of by  p . Then: 1. p =  2.  p   (1 ) n 3. When n is large and  is not too close to 0 or 1, the sampling distribution of p is approximately normal. A conservative rule says if n >= 10 and n(1 – ) >= 10, then it is safe to use a normal approximation. Chapter 9 Point Estimation = a single number that is based on sample data and represents a plausible value of the characteristic for the entire population Unbiased Statistic = a statistic with sampling mean value equal to the value of the population of the characteristic being estimated (otherwise it is called biased) x is an unbiased point estimate of  p is an unbiased point estimate of  s2 is an unbiased estimate of 2 s is a biased estimate of  When we have several unbiased estimates to choose from, we always want the one having minimum variance (called the minimum variance estimate). Confidence Interval = an interval of plausible values for the characteristic being studied; it is constructed so that, with a chosen degree of confidence, the value of the characteristic will be captured inside the interval Confidence Level = the success rate of the method used to construct a confidence interval Large Sample Confidence Interval for  1. Let p be the sample population and 2. Let n be the sample size p1  p  n Where z* is the z – value corresponding to the desired confidence level: 80% → 1.282 90% → 1.645 95% → 1.96 98% → 2.326 99% → 2.576 z Confidence Interval for  Used when the population standard deviation is known.     where z* is as above The confidence interval for  is x  z *     n   The confidence interval for  is p  z *   t Distributions 1. The t curve corresponding to any number of degrees of freedom is bell shaped and centered at 0 2. Each t curve is more spread out than the z curve 3. As the # of degrees of freedom increases, the spread of the corresponding t curves decreases 4. As the # of degrees of freedom increases, the corresponding sequence of t curves becomes closer and closer to the z curve t Confidence Interval for  Used when the population standard deviation is unknown. We instead use the sample standard deviation and a t distribution rather than the normal distribution.    s   The confidence interval for  is x  t *   n Chapter 10 – 1 Variable Hypothesis Tests Null Hypothesis (H0) = a claim about a population characteristic that is assumed to be true H0 = hypothesized value Alternative Hypothesis (Ha) = a statement that the null hypothesis is not true in some way Ha > hypothesized value Ha < hypothesized value Ha ≠ hypothesized value Type I Error = rejecting H0 when it is in fact true Type II Error = failing to reject H0 when it is in fact not true Level of Significance = the probability of a type I error (denoted ) Note: the probability of a type II error is denoted  As  gets smaller,  automatically gets larger. We will focus on controlling the value of . Test Statistic = the function of sample data on which a conclusion to reject or accept H0 is based For population proportions, the test statistic is p = sample proportion For population means, the test statistic is x = sample mean P-Value = the probability that an observed value (or a more extreme value) will occur assuming H0 is true The smaller the P-value, the stronger the evidence to reject H0 (we do this if P-value <= ) The larger the P-value, the stronger the evidence to accept H0 (we do this if P-value > ) We compare the P-value to the level of the test () to determine whether to reject H0 or not P-Value for a Population Proportion Hypothesis Test If we have that np  10 and n p  1  10 : Given H0:  = 0, a sample proportion p, and  p  1. Ha:  > 0 normalcdf p, 1010 ,  0 ,  p    3. Ha:  ≠ 0 2* normalcdf  1010 , p,  0 ,  p  n  2. Ha:  < 0 normalcdf  1010 , p,  0 ,  p   0 (1   0 ) or 2* normalcdf p, 1010 ,  0  if p <  ,   if p >  p 0 0 Note: This can be done in the TI-83 using the 1-PropZTest 5 Pieces of a Hypothesis Test 1. H0 2. Ha 3. Test statistic (p = # or x = #) 4. P-Value 5. Conclusion One-Sample t Test (or z Test) for a Population Mean If we have that n  30 : s Given H0:  = 0, a sample mean x , and  x  1. Ha:  > 0 tcdf t * , 1010 , df  n  t-curve with n-1 df 2. Ha:  < 0 tcdf  1010 , t * , df   t-curve with n-1 df 3. Ha:  ≠ 0 2* tcdf  1010 , t * , df   * or 2* tcdf t , 10 10  if x <  , df  if x >  0 0 t-curve with n-1 df ** In Calculator = ZTest or TTest Chapter 11 – 2 Variable Hypothesis Tests Independent Samples = two samples in which the selection of on sample in no way affects the selection of the other sample Paired Samples = observations from the first sample are in some meaningful way paired with observations from the second sample Sampling Distribution for x1  x 2 Suppose two quantities, A and B, are normal. Then A – B is normal with mean  A B   A   B and standard deviation  A B   A2   B2 So the distribution of x1  x 2 is normal with mean  x1 x2  1   2 and standard deviation  x1  x2   12 n1   22 n2 Two Sample t Test for 2 Population Means Assuming n1 and n2 both >=30 (or both populations are normal) and the samples were selected independently: H0: 1   2 = 0 Ha: One of: 1   2 1   2 1   2 Test Statistic: t as shown below P-Value: One of: Where t  Area to right of computed t under the t curve Area to the left of computed t under the t curve Sum of areas to the right of computed t and left of –(computed t) x1  x 2  1   2  s12 s 22  n1 n 2 and df =  s12 s 22   n  1 n2 2     2  s12   s 22   n   n  1 2   n1  1 n2  1 ** In Calculator = 2-SampZTest or 2-SampTTest 2 rounded down Paired t Test for Comparing 2 Population Means Assuming samples are paired, the n sample differences can be viewed as a random sample from a population of differences, and n >= 30 or both populations are normal: H0:  d  1   2  hypothesized value Ha: One of  d  hypothesized value  d  hypothesized value  d  hypothesized value Test Statistic: t as shown below P-Value: One of: Area to the right of calculated t under t curve with n – 1 df Area to the left of calculated t under t curve with n – 1 df Sum of areas to the right of t and left of –t Where t  x d  hypothesized value and df = n – 1 sd n ** In Calculator = ZTest or TTest (since we are back to one random variable) Large Sample z Test for 2 Population Proportions Assuming independent samples, n1p1 >= 10, n1(1 – p1) >= 10, n2p2 >= 10, and n2(1 – p2) >= 10: H0: 1 - 2 = 0 Ha: One of 1   2  0 1   2  0 1   2  0 Test Statistic: z p1  p 2 pc (1  pc ) p c (1  p c )  n1 n2 where pc  n1 p1  n2 p 2 n1  n2 P-Value: Upper, lower, or two tailed area under the z curve (just as prior proportion tests) ** In Calculator = 2-PropZTest Distribution Free = procedures that do not require any overly specific assumptions about the population distributions Rank Sum Test Assuming: The samples are randomly collected or the two treatments are randomly assigned to individuals, and the two population distributions have the same shape and spread (but not necessarily normal) H0: 1   2 = 0 1   2 Upper tail test 1   2 Lower tail test 1   2 Two tailed test Ha: One of: Test Statistic: Rank sum = sum of the ranks assigned to the observations in the first sample P-Value: Found from table on page 817 of Peck, Olsen, Devore **Rank: 1. List all observations (from both samples) from smallest to largest. 2. Rank them: smallest = 1, next smallest = 2, … Ties: rank each as the average of the positions in the list i.e. if 50 was both the 4th and 5th observation, each 50 would have rank 4.5 Chapter 12 – X2 Hypothesis Tests Goodness of Fit Statistic = X2 = a quantitative measure of the extent to which the observed counts differ from those expected when the null hypothesis is true X2   observed count  expected count 2 all cells expected count If all expected counts are >= 5, then the distribution has a 2 (chi – squared) probability distribution. Goodness of Fit Test  1  hypothesized value 1  2  hypothesized value 2 H0:  3  hypothesized value 3  k  hypothesized value k Ha: H0 is not true Test Statistic: X2 as defined above P-Value: Probability of getting X2 or larger in a chi-squared distribution with df = k – 1 **I wrote program Chi2 in Calculator to do this Two Way Frequency Table (Contingency Table) = rectangular table (matrix) that consists of a row for each possible value of x and a column for each possible value of y where x and y are two random categorical variables and each entry in the matrix is the frequency count (cell count) of that particular (x, y) combination Marginal Totals = sum of a row or a column Grand Total = sum of all entries Expected Cell Count = what would be expected when there is no difference between the groups or experiments under study Expected cell count = row marginal totalcolumn marginal total grand total Comparing Two or More Populations Using X2 Statistic Assuming the samples are chosen independently and the sample size is large (each expected count is at least 5) H0: true category proportions are the same for all populations (population homogeneity) Ha: true category proportions are not the same for all populations Test Stat: X 2   all cells observed count  expected count 2 expected count P-Value = area to the right of X2 under the chi-squared curve with df = (#row – 1)(#columns – 1) ** This same test can be used to check the independence of 2 categorical variables. In Calculator: Put data into a matrix, and then use X2-Test… Chapter 13 – Regression Analysis Deterministic Relationship = is one in which the value of y is completely determined by the value of x Probabilistic Model = a description of the relationship between two variable x and y that are not deterministically related Additive Probabilistic Model: y = deterministic function of x + random deviation (called e) Simple Linear Regression = assumes that there is a line with y-intercept  and slope ; this line is called the population regression line y =  + x + e Notes: 1. e has normal distribution 2. e has mean 0 and standard deviation  for any particular x – value 3. the random deviations (e1, e2, …, en) associated with different observations are independent of one another Estimating the Regression Line  For a collection of points (x, y), we find the regression y  a  bx where a is the point estimate of  and b is the point estimate of . a and b are “chosen” so that   y  y 2 is as small as possible (this is a calculus problem – see formula sheet for more details); this is called the least squares regression line and is the line given by the calculator when a regression is done using it Coefficient of Determination = the proportion of observed y variation that can be explained by the model relationship (see formula sheet for calculation) Sample Correlation Coefficient (r) = a measurement of how strongly the x and y values in a sample are linearly related to one another Population Correlation Coefficient () = a measurement of how strongly the x and y values in the entire population are linearly related to one another ***We use r to make inferences about . Bivariate Normal Distribution  for any fixed value of x, the distribution of the associated y – values is normal  for any fixed value of y, the distribution of the x – values is normal Test for Independence of Two Numerical Variables in a Bivariate normal population Assuming r is the correlation coefficient for a random sample from a bivariate normal population: H0:  = 0 (variables are independent)  0 Ha: One of   0  0 Test Statistic: t  (variables are not independent) r 1 r2 n2 P-Value: Area (to the right, left, or both ends) under the t curve with df= n – 2 In Calculator: use LinRegTTest… Chapter 14 – Multivariable Regression General Additive Multiple Regression Model Relates a dependent variable y to k predictor variables x1, x2, …, xk by the model equation y    1 x1   2 x 2  ...   k x k  e where the random deviation e is assumed to be normally distributed with mean 0 and variance 2 for any particular values of the predictor variables (which implies that for fixed values of the predictor variables, y has normal distribution with variance 2) Population Regression Coefficients The ’s in the above regression model Each  i represents how the y – value would change if the corresponding xi is increased by 1 unit and all other predictor variables are held constant Population Regression Function y    1 x1   2 x 2  ...   k x k = the mean y value for fixed values of the predictor variables Chapter 15 ANOVA = analysis of variance – checking whether the mean for more than 2 populations are identical Single Factor Analysis of Variance = comparison of k population or treatment means 1 ,  2 , ...,  k ANOVA Notation N = n1 + n2 + … + nk = total number of observations in the data set T = n1 x1  n 2 x 2  ...  n k x k = grand total = sum of all observations T N x = grand mean = Treatment Sum of Squares (SSTr) A measurement of the amount of variation from group to group 2 2 SSTr  n1  x1  x   n2  x 2  x   ...  nk  x k  x        2 (This has df = k – 1) Error Sum of Squares (SSE) A measurement of the amount of variation within each group SSE  n1  1 s12  n2  1 s 22  ...  nk  1 s k2 Mean Squares = a sum of squares  its df MSTr  SSTr = mean square for treatments k 1     2 x  xi   (This has df = N – k)    i 1  x in  group i  k   MSE  SSE = mean square for error N k Single-Factor ANOVA Test Essentially checking whether the variation within the group is the same as the variation from group to group – if they are “the same” then it is likely that H0 is true Assuming 1. each of the k populations is normal 2.  1   2  ...   k (good enough if largest sample deviation is ≤ 2(smallest sample standard deviation)) 3. observations in a given sample are independent of one another 4. data is collected in a random manner H0: 1   2  ...   k Ha: at least 2 of the ’s are different ANOVA Source df SS MS F Factor k–1 SSTr MSTr MSTr/MSE Error N–k SSE MSE Total P-Value = area of the upper tail of the F curve with df1 = k – 1 and df2 = N – k Total Sum of Squares (SSTo) SSTo    x  x    2 Fundamental Identity for a Single-Factor ANOVA SSTo = SSTr + SSE

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Terms