Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
An Introduction to Statistical Inference Glossary 2×2 table A two-way table where the explanatory and response variables each have two categories ........ 5-6 2SD Method Approximating a confidence interval by taking the statistic and the standard deviation of the statistic (from simulation or formula) and extending two standard deviations in each direction from the statistic. ........................................................................................................ 2-18 3S Strategy A framework for evaluating the strength of evidence against the chance model (null hypothesis). The 3 S’s are Statistic, Simulate, and Strength of Evidence .......................... 1-8, 1-16 90% confidence interval for An interval of values of π that would not be rejected by a two-sided test of significance with .10 level of significance. .................................................................................................... 2-10 alternative hypothesis The not by chance or there is an effect explanation, often the research conjecture. ........ 1-18, 1-25 anecdotal evidence Evidence for a conclusion that consists of only one or a few observations that may not be a good representation of the larger situation ................................................................................ 0-1 ANOVA test Analysis of variance test, is an overall test of multiple means that explores the variation between groups compared to the variation within groups ......................................................... 9-19 association Two variables are associated or related if the distribution of the response variable differs across the values of the explanatory variable ....................................................................... 4-3, 4-5 biased A sampling method is biased if the results from different samples consistently overestimate or consistently underestimate the population parameter of interest. ..... 3-1, 3-3, 3-12 binary variable Categorical variable with only two outcomes ................................................................... 1-17, 1-18 categorical variable (alt. Qualitative) A variable that places each observational unit into a category, and arithmetic operations (e.g., adding, subtracting) don’t make sense ..................................................................... 0-12, 0-17 causation In randomized experiments, can potentially conclude the explanatory variable is causing the effect seen in the response variable............................................................................... 4-3, 6-23 cell contributions Contribution of cell in two-way table to the chi-square statistic. Helpful in determining where large differences from observed data to what would be expected if the null hypothesis were true .................................................................................................................. 8-19 cells Entries in two-way tables ............................................................................................................. 5-3 An Introduction to Statistical Inference census A study in which data are gathered on all individuals in the population ..................................... 3-2 center A typical value in the data. Usually calculated as the average or the median .................. 0-14, 0-19 chi-square distribution A non-negative, right-skewed distribution used in theory-based test for an association between two categorical variables ............................................................................................. 8-19 chi-square statistic Theory-based test statistic used to evaluate the strength of evidence for an association between two categorical variables with multiple categories...................................................... 8-18 coefficient of determination The percentage of the variability in the response variable that is explained by the leastsquares regression on the explanatory variable. The coefficient of determination is equal to the square of the correlation coefficient. Denoted by r2 or R2.. .......................................... 10-24 conditional proportion Proportion of response variable outcomes for a given category of the explanatory variable5-4, 5-7 confidence intervals An inference tool used to estimate the value of the parameter, with an associated measure of uncertainty due to the randomness in the sample data ............................................................ 2-2 confidence level How confident we are that the population parameter is contained in our confidence interval. Represents the reliability of the procedure. .................................................. 0-4, 2-6, 2-10 confounding variable A confounding variable is a variable that is related to both the explanatory and response variable in such a way that its effects on the response variable cannot be separated from the explanatory variable. ....................................................................................................... 4-4, 4-6 convenience samples A non-random sample of a population......................................................................................... 3-4 correlation coefficient Statistic that measures direction and strength of a linear relationship between two quantitative variables ................................................................................................................. 10-5 critical value Multiplier of the standard deviation of the statistic, e.g., z*, together this product makes up the margin of error ................................................................................................................ 2-24 cumulative proportions Proportion of occurrences of an event after a set number of trials ............................................ 0-22 data Values measured or categories recorded on individual entities of interest ................. 0-5, 0-7, 0-11 data file (alt. spreadsheet; data table) A way to organize and store the data (the measurements of each observational unit on each variable) ............................................................................................................................. 0-11 descriptive statistics Numbers like averages and percentages or graphs like bar charts............................................... 0-4 direction Direction of association between two quantitative variables can be either positive (as one increases, so does the other) or negative (as one increases, the other decreases). ..................... 10-1 An Introduction to Statistical Inference distribution The characteristics of a variable’s behavior...................................................................... 0-12, 0-18 dotplots A basic way to graphically summarize quantitative data........................................................... 0-14 estimate A statistic which is our best guess for the size of the tendency (or difference) in the general population or underlying process. ................................................................................... 0-4 estimation Using the sample statistic to create a confidence interval to estimate the parameter of interest ........................................................................................................................................ 6-23 expected counts Number of observational units you would expect to observe in each cell of the two-way table if the null hypothesis of no association were true ............................................................. 8-26 experimental units What observational units are called in an experiment study...................................................... 4-11 explanatory variable The variable that, if the alternative hypothesis is true, is explaining changes in the response variable; sometimes known as the independent or predictor variable .... ….. 4-2, 4-3, 4-5 extrapolation Predicting values for the response variable for given values of the explanatory variable that are outside of the range of the original data ...................................................................... 10-20 F-distribution Theory-based approximation for simulated null distribution of F-statistic is non-negative and skewed right ............................................................................................................... 9-18, 9-26 follow-up analysis A second step in the analysis process that follows a significant ANOVA or chi-square test. A follow-up test tells where significant differences between pairs of groups are found. This is usually presented as confidence intervals for the difference in each pair of means or each pair of proportions ..................................................................................... 8-25, 9-21 form The form of association between two quantitative variables can be linear or can follow a more complicated curve ............................................................................................................. 10-1 F-statistic Ratio of variation between the groups to the variation within the groups .............. 9-17, 9-22, 9-25 generalization (alt. generalize) To extend conclusions from a sample to a larger population or process; this is only valid when the sample is representative of the population. ................................... 0-2, 0-9, 3-1, 3-6, 6-23 graphical summary Summary of a distribution that is a graph .................................................................................. 0-13 H0 Denotes null hypothesis ............................................................................................................. 1-33 Ha Denotes alternative hypothesis .................................................................................................. 1-33 An Introduction to Statistical Inference independent samples The data recorded on one sample are unrelated to those recorded on the other sample. In other words, if the data from the samples can be rearranged without altering the structure of the data then the samples are independent...................................................................... 7-4, 7-11 influential observations An observation is considered influential if removing it from the data set dramatically changes the correlation coefficient or regression line. Often have extreme x values. ... 10-6, 10-23 interval of plausible values An interval of values that have been tested under the null and have resulted in p-values higher than the significance level. or An interval of values that have been tested under the null and do not put the observed data in the tail of the simulated null distribution. These values are concluded to be plausible values for the population parameter. ...................... 2-6 logic of inference Involves two components: significance and estimation............................................................... 0-4 MAD A statistic testing an association between variables of more than two groups. (M)ean of the (A)bsolute values of the (D)ifferences in the sample averages or conditional proportions. ........................................................................................................ 8-6, 8-11, 9-5, 9-10 margin-of-error How much we expect in our statistic to vary from the parameter from the random sampling process alone (roughly two standard deviations); half-width of confidence interval ....................................................................................................................................... 2-15 matched-pairs design Observations are paired and dependent, such as repeat observations on the same individual or observations are paired naturally (e.g., identical twins) ......................................... 7-3 mean A measure of center in a distribution, also called the average................................................... 0-14 mean squares error Denominator of the F-statistic. Measures the within group variation. It is similar to averaging the standard deviations across the groups being compared. ..................................... 9-19 mean squares for treatment Numerator of the F-statistic. Measures the variation between the groups. ............................... 9-19 median The 50th percentile of a distribution or the middle number in a sorted list ................................ 0-14 model A mathematical or probabilistic conceptualization meant to closely match reality, but always making assumptions about the reality which may or may not be true:............................ 1-4 n A symbol used to indicate the sample size ................................................................................ 1-19 no association General statement of the null hypothesis when two or more variables are involved. ............... 5-12 non-sampling error Reasons why the statistic may not be close to the parameter that are separate from how the sample was selected from the population. ........................................................................... 3-27 An Introduction to Statistical Inference null distribution Distribution of simulated statistics that represent what could have happened in the study assuming the null hypothesis was true .............................................................................. 1-20, 1-27 null hypothesis The by chance alone or no effect explanation; A hypothesis that can be modeled by simulation.......................................................................................................................... 1-18, 1-25 numerical summary Summary of a distribution that is a number ............................................................................... 0-13 observational studies Studies in which researchers observe individuals and measure variables of interest, but do not intervene in order to attempt to influence responses. ............................................................ 4-2 observational units The individual entities on which data are recorded .................................................... 0-5, 0-7, 0-11 one proportion z-interval Theory-based confidence interval for π .................................................................................... 2−24 outcomes All possible values a variable can assume ................................................................................... 0-7 outlier An unusual observation. A value of a variable that differs substantially from the general pattern of the other observations in the data set ......................................................................... 0-20 outliers An observation with a large residual, not necessarily influential .............................................. 10-6 paired design Study design that allows for the comparison of two groups on a response variable but by comparing two measurements on each observational unit instead of on completely separate groups of individuals. This serves to reduce variability in the response variable. ...... 4-16 paired Data collected on paired samples consist of two sets of observations on the response variable that are recorded on the same set of observational units................................................ 7-4 parameter A number calculated from the underlying process or population from which the sample was selected .................................................................................................................. 0-6, 0-7, 3-2 p-hat The proportion or percentage of observational units that have a particular characteristic based on a measured variable. A statistic .................................................................................. 1-19 plausible value A parameter value tested under the null hypothesis that, based on the data gathered, we do not find strong evidence against the null ................................................................................ 2-4 plausible A term used to indicate that the chance model is a reasonable/believable explanation for the data we observed .................................................................................................................... 1-9 population The entire set of observational units we want to know about ............................................... 0-6, 3-1 practically significant Large enough, based on the context, to be meaningful.............................................................. 2-15 An Introduction to Statistical Inference predictor Another word for explanatory variable, often used in correlation/regression settings .............. 10-2 probability Long run proportion (relative frequency) of times an event would occur if the random process were repeated over and over again under identical conditions ............................ 0-24, 0-28 process A situation which we think of as a random selection from an underlying set of possible outcomes ...................................................................................................................................... 3-7 p-value The proportion of statistics in the null distribution that are at least as extreme as the value of the statistic actually observed in the study. .................................................................. 1-21, 1-27 quantitative variable Measures on an observational unit for which arithmetic operations (e.g., adding, subtracting) make sense ............................................................................................ 0-12, 0-17, 3-2 quasi-experiments Experiments that manipulate the explanatory variable, but not randomly. ............................... 4-12 r Symbol for correlation coefficient, values range from -1 to 1 and are unit-less. Values close to -1 and 1 denote a strong linear relationship while values close to 0 denote a weak or no linear relationship ............................................................................................................. 10-5 random digit dialing A common sampling technique when a sampling frame is unavailable. It involves a computer randomly dialing phone numbers within a certain area code by randomly selecting the digits to be dialed after the area code. .................................................................. 3-25 random events An event with an unknown short-term outcome, but a known long-term relative frequency.................................................................................................................................... 0-22 random sampling Using a probability device to select observational units from a population or process ...... 3-6, 3-30 randomized, comparative experiment An experiment where experimental units are randomly assigned to two or more treatment conditions and the explanatory variable is actively imposed on the subjects. ........................... 4-11 range Distance from the largest to the smallest value in a data set ..................................................... 0-15 regression equation Least squares regression equation, where a is the y-intercept, b is the slope, x represents the explanatory variable, and ŷ (pronounced y-hat) is the predicted value for the response variable. .................................................................................................................................... 10-19 relative frequency Long run proportion ................................................................................................................... 0-24 representative Describes a sample with statistics similar to the parameters in the entire population. Simple random samples are representative; convenience samples may not be representative ........................................................................................................................ 3-1, 3-3 An Introduction to Statistical Inference residuals Same as prediction errors. The vertical distances between a point and the least squares regression line ............................................................................................................... 10-17, 10-22 response rate Of those selected to be in the sample, the percent that respond. ............................................... 3-26 response variable The variable that , if the alternative hypothesis is true, is impacted by the explanatory variable; sometimes known as the dependent variable. ............................................................... 4-2 sample size Number of observational units ..................................................................................................... 0-7 sample The subgroup of the population on which we record data ............................................ 0-6, 3-1, 3-2 sampling frame A list of all of the members of the population of interest ................................................... 3-4, 3-13 sampling variability The amount that a value changes as it is observed repeatedly ..................................................... 3-5 scatterplot Graphical display of the relationship between two quantitative variables ................................ 10-1 scope of inference Involves two components that depend on how the data was gathered: generalizability and cause-and-effect ........................................................................................................................... 0-4 segmented bar graphs Graphical display of conditional proportion from two-way table ............................................... 5-4 shape The form of a graph of quantitative data ................................................................................... 0-14 significance level A value used as a criterion for deciding how small a p-value needs to be to provide convincing evidence against the null hypothesis .................................................................. 2-5, 2-9 significance The sample results are unlikely to have arisen by chance alone................................................ 6-23 simple random sample Selecting individuals from the sampling frame, so that each individual has the same probability of being selected into the sample ............................................................. 3-4, 3-6, 3-13 simulate Artificially represent a real life situation by generating observations from a model................... 1-5 simulated data Data that are generated by a chance model .................................................................................. 1-5 simulation Artificial representation of a random process used to study the process’s long-term properties.................................................................................................................................... 0-24 skewed distribution A distribution with the bulk of the data on one side and a tail on the other, the direction of the skew is the side the tail is on ................................................................................................ 0-14 slope Change in predicted response variable divided by change in explanatory variable ................ 10-19 An Introduction to Statistical Inference SSE Sum of squared errors. It is the sum of all the squared residuals............................................. 10-22 standard deviation (SD) A typical deviation or distance of the data values from their average .................... 0-15, 0-20, 2-14 standard deviation of p-hat: The standard deviation of the distribution of sample proportions can be shown mathematically to follow the formula π (1 − π ) / n ................................................................... 1-54 standard error Approximate estimate for the standard deviation of the null distribution ........................ 5-35, 7-23 standardize To standardize an observation, compute the distance of the observation from the mean and divide by the standard deviation of the distribution. .................................................. 1-35, 1-38 statistic + margin-of-error Most confidence intervals can be written in this form ............................................................... 2-15 statistic A number calculated from the observed data which summarizes information about the variable or variables of interest .............................................................................. 0-2, 0-6, 0-7, 3-2 statistical inference Drawing conclusions beyond the sample data to a larger population or process ........................ 0-8 statistical significance When the sample results are unlikely to have arisen by chance alone ........................................ 1-2 statistical spreadsheet The file of individual data values with variables as columns and observational units as rows ............................................................................................................................................ 0-11 statistical tendency Not a hard-and-fast rule, rather something that is typically observable .................................... 0-19 statistically significant Sample results that are unlikely to have arisen by chance alone are considered statistically significant ......................................................................................................... 0-4, 0- 8 Statistics A discipline that guides us in collecting, exploring and drawing conclusions from data ............ 0-1 strength of evidence Determining whether the observed statistic provides convincing evidence against the null hypothesis ........................................................................................................................... 1-2, 6-23 strength of association Strength of association between two quantitative variables tells how closely data follow a particular pattern, be it linear or a more complicated curve ...................................................... 10-1 subjects Study participants that are human .............................................................................................. 1-24 symmetric distribution A distribution with a vertical line of symmetry through it ........................................................ 0-14 test of significance A procedure for measuring the strength of evidence against a null hypothesis about the parameter of interest .................................................................................................................. 1-17 An Introduction to Statistical Inference theory-based approach for a single proportion A test of significance that uses the Central Limit Theorem to predict the simulated null distribution of sample proportions. If the sample size is large enough the predicted distribution will be bell-shaped (or normal), centered at the underlying probability (π) , with a standard deviation of π (1 − π ) / n . ................................................................................ 1-54 transform Express data on a different scale, such as logarithmic, often used to meet validity conditions ................................................................................................................................. 10-36 t-standardized statistic Similar to the standardized z-statistic for proportions, this is a standardized statistic for means. Typically the distribution is bell-shaped and symmetric,but it is not exactly normally distributed, being less peaked with fatter tails. As the sample size increases, the distribution becomes more normal............................................................................................. 6-32 two by two (2 × 2) A two-way table where the explanatory and response variables each have two categories ........ 5-4 two-sided test Estimates the p-value by considering results that are at least as extreme as our observed result in either direction ............................................................................................................. 1-45 two-way table A tabular summary of two categorical variables, also called a contingency table ...................... 5-3 unbiased A sampling method that, on average across many random samples, produces statistics whose average is the value of the population parameter ............................................................. 3-6 unusual observations Any observations that do not fit in with the pattern of the rest of the observations .................. 10-5 validity check Check to see that certain conditions are met the render the theory-based approach valid. Often these conditions deal with sample size, shape, and variability of distributions............... 1-54 validity conditions for ANOVA For the F-statistic to follow an F distribution, need each population distribution to be normal and the population standard deviations to be equal ....................................................... 9-20 validity conditions for chi-square test Each cell of the two-way table must have at least 10 observations ........................ ……..8-19, 8-24 validity conditions for comparing two averages For the theory-based method for means to apply, want sample sizes in each group of at least 20, OR distributions of quantitative variable in both groups are normal (Bell-shaped and symmetric)........................................................................................................................... 6-34 validity conditions for matched pairs Sample size of pairs is at least 20 OR the differences follow a normal distribution ................. 7-28 validity conditions for one proportion To use theory-based approach for a single proportion, must have at least 10 observations in each category ......................................................................................................................... 2-26 variability Fluctuations in data. In Statistics one needs to consider source of variability ................. 0-15, 0-19 variables Recorded characteristics of the observational units .................................................... 0-5, 0-7, 0-11 An Introduction to Statistical Inference y-intercept Predicted value of the response variable when the explanatory variable has a value of zero ........................................................................................................................................... 10-19 z-statistic z-statistic, also called the standardized statistic, compares an observed statistic with a hypothesized parameter value, mostly used with proportions (and t-statistic with means)....... 1-54 α Parameter for y-intercept of a regression line .......................................................................... 10-30 β Parameter for slope of a regression line ................................................................................... 10-30 ρ Symbol for population correlation coefficient ........................................................................... 10-7 π Symbol used for the unknown underlying process probability or true population proportion. ................................................................................................................................ 1−19 ̅ d Observed sample average of the differences ............................................................................... 7-6 µd Population or process parameter for the average of the differences ............................................ 7-6