Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Stats 156 – Terms Chapter 1 Population = the entire collection of individuals or objects about which information is desired Sample = a subset of the population selected for study in some way Categorical Data = data that is qualitative in nature Numerical Data = data that is quantitative in nature Continuous = the data is made up of some interval on the number line Discrete = the data is a collection of separated points on the number line Chapter 2 Bias = tendency for samples to differ from corresponding population Selection Bias = some part of the population is systematically excluded from the sample Measurement Bias = the method of observation produces values that differ from the population Non-Response Bias = responses are not obtained from all members of the sample ** Selection Bias is the most common Extraneous Factor = a factor not of interest in the current study, but is thought to affect the response variable in some way Confounded Factors = two factors whose effects on the response variable cannot be distinguished from one another in any way Chapter 3 Relative Frequency = the fraction of proportion of the time that a particular category is in the data set Frequency Distribution = a table that displays the possible categories along with the associated frequencies, relative frequencies, cumulative frequencies, and relative cumulative frequencies Outlier = an unusually small or large data value Dot Plot, Stem & Leaf, Histogram Chapter 4 Mean = average of the observation values Median = the middle value when the data is arranged sequentially Mode = the observation value occurring most often Standard Deviation = a “typical” deviation from the mean Lower Quartile (Q1) = the median of the lower half of the data Upper Quartile (Q3) = the median of the upper half of the data Interquartile Range (IQR) = upper quartile – lower quartile Outlier = a sample observation that lies more than 1.5(IQR) away from the upper or lower quartile, or an observation that has |z-score| > 2 Chebyshev’s Rule = the percentage of observations that are within K standard deviations of the mean is at least 100 1 1 2 % k So at least 75% are within 1 standard deviation At least 88.9% are within 2 standard deviations At least 93.75% are within 3 standard deviations … Empirical Rule = If the data is well represented by a normal curve, then 68% of the observations are within 1 standard deviation of the mean, 95% are within 2, and 99.7% are within 3 Z-Score = how many standard deviations an observation is from the mean 5 Number Summary of Data Set = min, Q1, median, Q3, max Chapter 5 Residual = difference between observed value and regression model value (Pearson’s) Correlation Coefficient = measures how closely the data fall on a straight line Coefficient of Determination = tells what percent of the variation can be explained by the linear relationship between x and y Chapters 6 & 7 Chance Experiment = any activity or situation in which there is uncertainty about which of two or more possible outcomes will result Probability of Event E (P(E)) = the ratio of the number possible ways to get E to the total number of possible outcomes Properties of Probability 1. For any event E, 0 P( E ) 1 2. If S is the entire sample space, P(S) = 1 3. If events E and F are disjoint, then PE F P( E ) P( F ) otherwise PE F P( E ) P( F ) PE F 4. For any event E, P( E ) 1 P( E ) Random Variable = a numerical variable whose value depends on the outcome of a chance experiment Discrete = possible outcomes are isolated points on a number line Continuous = possible outcomes are an entire interval on the number line Probability Distribution (discrete x) = probability of each possible outcome of the random variable x Probability Distribution (continuous x) = a function (sometimes called the density function of the variable x) with the properties that f ( x) 0 for all x and In this case, Pa x b b a f ( x)dx 1 f ( x)dx Normal Distribution = any probability distribution that has a bell shape curve as its density function Standard Normal Distribution = the normal distribution (bell shapes curve) with mean 0 and standard deviation 1 *Calculator can find probabilities for normal distributions with mean and standard deviation Chapter 8 Statistic = any value computed from values in a sample Sampling Distribution = the probability distribution for a statistic x Sampling x Sampling = for samples of size n in a population, we find x for each sample and the variable x becomes the random variable of interest General Properties of an x Sampling Distribution Let x denote the mean of the observations in a random sample of size n from a population having mean and standard deviation . 1. x 2. x2 2 n x n 3. When the population is normal, the sampling distribution of x is also normal for any size n 4. For large values of n n 30 , the sampling distribution of x is approximately normal regardless of whether or not the population itself is normal = Central Limit Theorem Proportion Sampling S = Success = an individual or object has a specific property that is under investigation F = Failure = does not have the property = the proportion of successes in the entire population General Properties of a Proportion Sampling Distribution Let p be the proportion of successes in a random sample of size n from a population whose proportion of S’s is . Denote the mean value of p by p and the standard deviation of by p . Then: 1. p = 2. p (1 ) n 3. When n is large and is not too close to 0 or 1, the sampling distribution of p is approximately normal. A conservative rule says if n >= 10 and n(1 – ) >= 10, then it is safe to use a normal approximation. Chapter 9 Point Estimation = a single number that is based on sample data and represents a plausible value of the characteristic for the entire population Unbiased Statistic = a statistic with sampling mean value equal to the value of the population of the characteristic being estimated (otherwise it is called biased) x is an unbiased point estimate of p is an unbiased point estimate of s2 is an unbiased estimate of 2 s is a biased estimate of When we have several unbiased estimates to choose from, we always want the one having minimum variance (called the minimum variance estimate). Confidence Interval = an interval of plausible values for the characteristic being studied; it is constructed so that, with a chosen degree of confidence, the value of the characteristic will be captured inside the interval Confidence Level = the success rate of the method used to construct a confidence interval Large Sample Confidence Interval for 1. Let p be the sample population and 2. Let n be the sample size p1 p n Where z* is the z – value corresponding to the desired confidence level: 80% → 1.282 90% → 1.645 95% → 1.96 98% → 2.326 99% → 2.576 z Confidence Interval for Used when the population standard deviation is known. where z* is as above The confidence interval for is x z * n The confidence interval for is p z * t Distributions 1. The t curve corresponding to any number of degrees of freedom is bell shaped and centered at 0 2. Each t curve is more spread out than the z curve 3. As the # of degrees of freedom increases, the spread of the corresponding t curves decreases 4. As the # of degrees of freedom increases, the corresponding sequence of t curves becomes closer and closer to the z curve t Confidence Interval for Used when the population standard deviation is unknown. We instead use the sample standard deviation and a t distribution rather than the normal distribution. s The confidence interval for is x t * n Chapter 10 – 1 Variable Hypothesis Tests Null Hypothesis (H0) = a claim about a population characteristic that is assumed to be true H0 = hypothesized value Alternative Hypothesis (Ha) = a statement that the null hypothesis is not true in some way Ha > hypothesized value Ha < hypothesized value Ha ≠ hypothesized value Type I Error = rejecting H0 when it is in fact true Type II Error = failing to reject H0 when it is in fact not true Level of Significance = the probability of a type I error (denoted ) Note: the probability of a type II error is denoted As gets smaller, automatically gets larger. We will focus on controlling the value of . Test Statistic = the function of sample data on which a conclusion to reject or accept H0 is based For population proportions, the test statistic is p = sample proportion For population means, the test statistic is x = sample mean P-Value = the probability that an observed value (or a more extreme value) will occur assuming H0 is true The smaller the P-value, the stronger the evidence to reject H0 (we do this if P-value <= ) The larger the P-value, the stronger the evidence to accept H0 (we do this if P-value > ) We compare the P-value to the level of the test () to determine whether to reject H0 or not P-Value for a Population Proportion Hypothesis Test If we have that np 10 and n p 1 10 : Given H0: = 0, a sample proportion p, and p 1. Ha: > 0 normalcdf p, 1010 , 0 , p 3. Ha: ≠ 0 2* normalcdf 1010 , p, 0 , p n 2. Ha: < 0 normalcdf 1010 , p, 0 , p 0 (1 0 ) or 2* normalcdf p, 1010 , 0 if p < , if p > p 0 0 Note: This can be done in the TI-83 using the 1-PropZTest 5 Pieces of a Hypothesis Test 1. H0 2. Ha 3. Test statistic (p = # or x = #) 4. P-Value 5. Conclusion One-Sample t Test (or z Test) for a Population Mean If we have that n 30 : s Given H0: = 0, a sample mean x , and x 1. Ha: > 0 tcdf t * , 1010 , df n t-curve with n-1 df 2. Ha: < 0 tcdf 1010 , t * , df t-curve with n-1 df 3. Ha: ≠ 0 2* tcdf 1010 , t * , df * or 2* tcdf t , 10 10 if x < , df if x > 0 0 t-curve with n-1 df ** In Calculator = ZTest or TTest Chapter 11 – 2 Variable Hypothesis Tests Independent Samples = two samples in which the selection of on sample in no way affects the selection of the other sample Paired Samples = observations from the first sample are in some meaningful way paired with observations from the second sample Sampling Distribution for x1 x 2 Suppose two quantities, A and B, are normal. Then A – B is normal with mean A B A B and standard deviation A B A2 B2 So the distribution of x1 x 2 is normal with mean x1 x2 1 2 and standard deviation x1 x2 12 n1 22 n2 Two Sample t Test for 2 Population Means Assuming n1 and n2 both >=30 (or both populations are normal) and the samples were selected independently: H0: 1 2 = 0 Ha: One of: 1 2 1 2 1 2 Test Statistic: t as shown below P-Value: One of: Where t Area to right of computed t under the t curve Area to the left of computed t under the t curve Sum of areas to the right of computed t and left of –(computed t) x1 x 2 1 2 s12 s 22 n1 n 2 and df = s12 s 22 n 1 n2 2 2 s12 s 22 n n 1 2 n1 1 n2 1 ** In Calculator = 2-SampZTest or 2-SampTTest 2 rounded down Paired t Test for Comparing 2 Population Means Assuming samples are paired, the n sample differences can be viewed as a random sample from a population of differences, and n >= 30 or both populations are normal: H0: d 1 2 hypothesized value Ha: One of d hypothesized value d hypothesized value d hypothesized value Test Statistic: t as shown below P-Value: One of: Area to the right of calculated t under t curve with n – 1 df Area to the left of calculated t under t curve with n – 1 df Sum of areas to the right of t and left of –t Where t x d hypothesized value and df = n – 1 sd n ** In Calculator = ZTest or TTest (since we are back to one random variable) Large Sample z Test for 2 Population Proportions Assuming independent samples, n1p1 >= 10, n1(1 – p1) >= 10, n2p2 >= 10, and n2(1 – p2) >= 10: H0: 1 - 2 = 0 Ha: One of 1 2 0 1 2 0 1 2 0 Test Statistic: z p1 p 2 pc (1 pc ) p c (1 p c ) n1 n2 where pc n1 p1 n2 p 2 n1 n2 P-Value: Upper, lower, or two tailed area under the z curve (just as prior proportion tests) ** In Calculator = 2-PropZTest Distribution Free = procedures that do not require any overly specific assumptions about the population distributions Rank Sum Test Assuming: The samples are randomly collected or the two treatments are randomly assigned to individuals, and the two population distributions have the same shape and spread (but not necessarily normal) H0: 1 2 = 0 1 2 Upper tail test 1 2 Lower tail test 1 2 Two tailed test Ha: One of: Test Statistic: Rank sum = sum of the ranks assigned to the observations in the first sample P-Value: Found from table on page 817 of Peck, Olsen, Devore **Rank: 1. List all observations (from both samples) from smallest to largest. 2. Rank them: smallest = 1, next smallest = 2, … Ties: rank each as the average of the positions in the list i.e. if 50 was both the 4th and 5th observation, each 50 would have rank 4.5 Chapter 12 – X2 Hypothesis Tests Goodness of Fit Statistic = X2 = a quantitative measure of the extent to which the observed counts differ from those expected when the null hypothesis is true X2 observed count expected count 2 all cells expected count If all expected counts are >= 5, then the distribution has a 2 (chi – squared) probability distribution. Goodness of Fit Test 1 hypothesized value 1 2 hypothesized value 2 H0: 3 hypothesized value 3 k hypothesized value k Ha: H0 is not true Test Statistic: X2 as defined above P-Value: Probability of getting X2 or larger in a chi-squared distribution with df = k – 1 **I wrote program Chi2 in Calculator to do this Two Way Frequency Table (Contingency Table) = rectangular table (matrix) that consists of a row for each possible value of x and a column for each possible value of y where x and y are two random categorical variables and each entry in the matrix is the frequency count (cell count) of that particular (x, y) combination Marginal Totals = sum of a row or a column Grand Total = sum of all entries Expected Cell Count = what would be expected when there is no difference between the groups or experiments under study Expected cell count = row marginal totalcolumn marginal total grand total Comparing Two or More Populations Using X2 Statistic Assuming the samples are chosen independently and the sample size is large (each expected count is at least 5) H0: true category proportions are the same for all populations (population homogeneity) Ha: true category proportions are not the same for all populations Test Stat: X 2 all cells observed count expected count 2 expected count P-Value = area to the right of X2 under the chi-squared curve with df = (#row – 1)(#columns – 1) ** This same test can be used to check the independence of 2 categorical variables. In Calculator: Put data into a matrix, and then use X2-Test… Chapter 13 – Regression Analysis Deterministic Relationship = is one in which the value of y is completely determined by the value of x Probabilistic Model = a description of the relationship between two variable x and y that are not deterministically related Additive Probabilistic Model: y = deterministic function of x + random deviation (called e) Simple Linear Regression = assumes that there is a line with y-intercept and slope ; this line is called the population regression line y = + x + e Notes: 1. e has normal distribution 2. e has mean 0 and standard deviation for any particular x – value 3. the random deviations (e1, e2, …, en) associated with different observations are independent of one another Estimating the Regression Line For a collection of points (x, y), we find the regression y a bx where a is the point estimate of and b is the point estimate of . a and b are “chosen” so that y y 2 is as small as possible (this is a calculus problem – see formula sheet for more details); this is called the least squares regression line and is the line given by the calculator when a regression is done using it Coefficient of Determination = the proportion of observed y variation that can be explained by the model relationship (see formula sheet for calculation) Sample Correlation Coefficient (r) = a measurement of how strongly the x and y values in a sample are linearly related to one another Population Correlation Coefficient () = a measurement of how strongly the x and y values in the entire population are linearly related to one another ***We use r to make inferences about . Bivariate Normal Distribution for any fixed value of x, the distribution of the associated y – values is normal for any fixed value of y, the distribution of the x – values is normal Test for Independence of Two Numerical Variables in a Bivariate normal population Assuming r is the correlation coefficient for a random sample from a bivariate normal population: H0: = 0 (variables are independent) 0 Ha: One of 0 0 Test Statistic: t (variables are not independent) r 1 r2 n2 P-Value: Area (to the right, left, or both ends) under the t curve with df= n – 2 In Calculator: use LinRegTTest… Chapter 14 – Multivariable Regression General Additive Multiple Regression Model Relates a dependent variable y to k predictor variables x1, x2, …, xk by the model equation y 1 x1 2 x 2 ... k x k e where the random deviation e is assumed to be normally distributed with mean 0 and variance 2 for any particular values of the predictor variables (which implies that for fixed values of the predictor variables, y has normal distribution with variance 2) Population Regression Coefficients The ’s in the above regression model Each i represents how the y – value would change if the corresponding xi is increased by 1 unit and all other predictor variables are held constant Population Regression Function y 1 x1 2 x 2 ... k x k = the mean y value for fixed values of the predictor variables Chapter 15 ANOVA = analysis of variance – checking whether the mean for more than 2 populations are identical Single Factor Analysis of Variance = comparison of k population or treatment means 1 , 2 , ..., k ANOVA Notation N = n1 + n2 + … + nk = total number of observations in the data set T = n1 x1 n 2 x 2 ... n k x k = grand total = sum of all observations T N x = grand mean = Treatment Sum of Squares (SSTr) A measurement of the amount of variation from group to group 2 2 SSTr n1 x1 x n2 x 2 x ... nk x k x 2 (This has df = k – 1) Error Sum of Squares (SSE) A measurement of the amount of variation within each group SSE n1 1 s12 n2 1 s 22 ... nk 1 s k2 Mean Squares = a sum of squares its df MSTr SSTr = mean square for treatments k 1 2 x xi (This has df = N – k) i 1 x in group i k MSE SSE = mean square for error N k Single-Factor ANOVA Test Essentially checking whether the variation within the group is the same as the variation from group to group – if they are “the same” then it is likely that H0 is true Assuming 1. each of the k populations is normal 2. 1 2 ... k (good enough if largest sample deviation is ≤ 2(smallest sample standard deviation)) 3. observations in a given sample are independent of one another 4. data is collected in a random manner H0: 1 2 ... k Ha: at least 2 of the ’s are different ANOVA Source df SS MS F Factor k–1 SSTr MSTr MSTr/MSE Error N–k SSE MSE Total P-Value = area of the upper tail of the F curve with df1 = k – 1 and df2 = N – k Total Sum of Squares (SSTo) SSTo x x 2 Fundamental Identity for a Single-Factor ANOVA SSTo = SSTr + SSE