Download Glossary

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data assimilation wikipedia , lookup

Interaction (statistics) wikipedia , lookup

Instrumental variables estimation wikipedia , lookup

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Least squares wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
An Introduction to Statistical Inference
Glossary
µd
Population or long-run parameter for the average of the differences ....................................... 7-5
2×2 table
A two-way table where the explanatory and response variables each have two categories .... 5-6
2SD Method
Approximating a confidence interval by taking the statistic and the standard deviation of the
statistic (from simulation or formula) and extending two standard deviations in each direction
from the statistic. .......................................................................................................... 2-15, 2-19
3S Strategy
A framework for evaluating the strength of evidence against the chance model (null hypothesis).
The 3 S’s are: Statistic, Simulate, and Strength of Evidence .......................................... 1-8, 1-16
alternative hypothesis
The not by chance or there is an effect explanation, it is our research conjecture. ....... 1-18, 1-26
ANOVA test
Analysis of variance test, is an overall test of multiple means that explores the variation between
groups compared to the variation within groups .................................................................... 9-21
association
Two variables are associated or related if the distribution of the response variable differs across
the values of the explanatory variable
4-3, 4-5
bar graph
A graphical display of the distribution of a categorical variable .............................................. 1-10
biased
A sampling method is biased if the results from different samples consistently overestimate or
consistently underestimate the population parameter of interest. ............................ 3-1, 3-4, 3-13
binary variable
Categorical variable with only two outcomes ................................................................ 1-17, 1-18
boxplots
A graphical display of the five number summary ............................................................ 9-4, 9-10
causation
Inferring that the explanatory variable is causing the effect seen in the response variable .... 6-27
cause-and-effect
In well-designed studies (randomized experiments), can conclude the explanatory variable is
causing the effect seen in the response variable ..................................................................... 4-3
cell contributions
Contribution of cell in 2-way table to the chi-square statistic. Helpful in determining where large
differences from observed data to what would be expected if the null hypothesis were true . 8-22
cells
Entries in two-way tables ........................................................................................................ 5-3
census
When data are gathered on all individuals in the population .................................................... 3-2
chance models
A real or computerized process to generate data according to a well-understood set of
conditions....................................................................................................................... 1-5, 1-16
Chi-square distribution
A non-negative, right-skewed distribution used in theory-based test for an association between
two categorical variables ....................................................................................................... 8-21
An Introduction to Statistical Inference
Chi-square statistic
Theory-based test statistic used to evaluate the strength of evidence for an association between
two categorical variables with multiple categories ................................................................. 8-20
coefficient of determination: Denoted by r2 or R2. It is interpreted as the percentage of the
variability in the response variable that is explained by the least-squares regression on the
explanatory variable. The coefficient of determination is equal to the square of the correlation
coefficient................................................................................................................. 10-23, 10-30
conditional proportion
Proportion of response variable for a given category of the explanatory variable ............. 5-4, 5-7
confidence intervals
An inference tool used to estimate the value of the parameter, with an associated measure of
uncertainty due to the randomness in the sample data ........................................................... 2-2
confidence level
A statement of reliability in the confidence interval method ................................... 2-7, 2-11, 2-33
confounding variable
A confounding variable is a variable that is related to both the explanatory and response
variable in such a way that its effects on the response variable cannot be separated from the
explanatory variable. ........................................................................................................ 4-4, 4-6
convenience samples
A non-random sample of a population..................................................................................... 3-4
correlation coefficient
Statistic that measures direction and strength of a linear relationship between two quantitative
variables ...................................................................................................................... 10-4, 10-9
data table
Format for storing data values ................................................................................................. 3-2
distribution:
The characteristics of a variable’s behavior ............................................................................. 6-2
double-blind study
Both the researchers and subjects are blind to which treatment each subject receives ......... 6-19
estimation
Using the sample statistic to create a confidence interval to estimate the parameter of interest 626
expected counts
Number of observational units you would expect to observe in each cell of the two-way table if
the null hypothesis of no association were true ..................................................................... 8-30
experiment
A study in which researcher actively assign subjects to treatment groups ............................ 4-10
experimental units
What observational units are called in an experiment study ........................................ 4-10, 4-15
explanatory variable
The variable that, if the alternative hypothesis is true, is explaining changes in the response
variable; sometimes known as the independent or predictor variable ............................... 4-3, 4-6
extrapolation
Predicting values for the response variable for given values of the explanatory variable that are
outside of the range of the original data ................................................................... 10-24, 10-29
F-distribution
Theory-based approximation for simulated null distribution of F-statistic is non-negative and
skewed right................................................................................................................. 9-20, 9-26
five-number summary
minimum, lower quartile, median, upper quartile, maximum .................................................... 6-3
An Introduction to Statistical Inference
follow-up analysis
A second step in the analysis process that follows a significant ANOVA test. A follow-up test
tells where significant differences between pairs of groups are found. This is usually presented
as confidence intervals for the difference in each pair of means .................................. 9-23, 9-27
F-statistic
Ratio of variation between the groups to the variation within the groups ............. 9-19, 9-25, 9-29
generalize, generalization
Extension of conclusions from a sample to a population; this is only valid when the sample is
representative of the population. .............................................................................. 3-1,3-7, 6-26
H0
Denotes null hypothesis ........................................................................................................ 1-33
Ha
Denotes alternative hypothesis ............................................................................................. 1-33
histogram
A graph used with quantitative variables ........................................................................ 3-3, 3-35
independent groups design
Each individual in a group is unrelated to all the other individuals in the study. Each individual
provides one response value. ...................................................................................... 4-22, 4-25
independent samples
The data recorded on one sample are unrelated to those recorded on the other sample. In other
words, if the data from the samples can be rearranged without affecting the outcome then the
samples are independent. .............................................................................................. 7-4, 7-10
influential observations
An observational is considered influential if removing it from the data set dramatically changes
the correlation coefficient or regression line. Often have extreme x values. ... 10-5, 10-25, 10-28
inter-quartile range (IQR)
The difference between the upper quartile and the lower quartile ........................................... 6-3
lower quartile
The value for which 25% of the data lie below ........................................................................ 6-3
MAD
A for testing an association between two categorical variables of more than two categories.
(M)ean of the (A)bsolute values of the (D)ifferences in the conditional proportions ........ 8-6, 8-12
margin-of-error
The half-width of a confidence interval ......................................................................... 2-15, 2-19
matched-pairs design
Randomize the order in which each subject receives treatment, but each subject receives both
treatments ............................................................................................................................... 7-3
mean squares error
Denominator of the F-statistic. Measures the within group variation. It is similar to averaging the
standard deviations across the groups being compared........................................................ 9-21
mean squares for treatment
Numerator of the F-statistic. Measures the variation between the groups. ............................ 9-21
model
A mathematical or probabilistic conceptualization meant to closely match reality, but always
making assumptions about the reality which may or may not be true: ..................................... 1-5
n
A symbol used to indicate the sample size ............................................................................ 1-19
no association
General statement of the null hypothesis when two or more variables are involved. ............. 5-11
non-sampling errors
Reasons why the statistic may not be close to the parameter that are separate from how the
An Introduction to Statistical Inference
sample was selected from the population.............................................................................. 3-44
null distribution
Distribution of simulated statistics that represent what could have happened in the study
assuming the null hypothesis was true ......................................................................... 1-20, 1-27
null hypothesis
The by chance alone or no effect explanation; A hypothesis that can be modeled by simulation.
.................................................................................................................................... 1-18, 1-26
observational study
Studies in which researchers observe individuals and measure variables of interest, but do not
intervene in order to attempt to influence responses. ................................................... 4-10, 4-15
one-proportion z-test
name for the theory-based approach with one proportion ..................................................... 1-62
outliers
An observation with a large residual, not necessarily influential ............................................ 10-5
paired data
Data collected on paired samples consist of two sets of observations on the response variable
that are recorded on the same set of observational units ........................................................ 7-4
paired design
Study design that allows for the comparison of two groups on a response variable but by
comparing two measurements on each observational unit instead of on completely separate
groups of individuals. This serves to reduce variability in the response variable........... 4-22, 4-25
parameter
A number calculated from the underlying process or population from which the sample was
selected ......................................................................................................................... 3-2, 3-12
p-hat
The proportion or percentage of observational units that have a particular characteristic based
on a measured variable. A statistic........................................................................................ 1-19
plausible value
A parameter value tested under the null hypothesis where, based on the data gathered, we do
not find strong evidence against the null ................................................................................. 2-5
plausible
A term used to indicate that the chance model is a reasonable/believable explanation for the
data we observed.................................................................................................................... 1-9
population
The entire set of observational units we want to know about ................................................... 3-1
predictor
Another word for explanatory variable, often used in correlation/regression settings ............. 10-2
process
A situation which we think of as a random selection from an underlying set of possible outcomes
............................................................................................................................................. 3-18
p-value
The proportion of statistics in the null distribution that are at least as extreme as the value of the
statistic actually observed in the study. ........................................................................ 1-21, 1-28
quantitative variable
Measures on an observational unit for which arithmetic operations (e.g., adding, subtracting)
make sense ............................................................................................................................ 3-2
quasi-experiments
Experiments that manipulate the explanatory variable, but not randomly .............................. 4-14
r
Symbol for correlation coefficient, values range from -1 to 1 and are unit-less. Values close to -1
and 1 denote a strong linear relationship while values close to 0 denote a weak or no linear
An Introduction to Statistical Inference
relationship ........................................................................................................................... 10-4
random digit dialing
A common sampling technique when a sampling frame is unavailable. It involves a computer
randomly dialing phone numbers within a certain area code by randomly selecting the digits to
be dialed after the area code................................................................................................. 3-42
random sampling
Using a probability device to select observational units from a population or process .... 3-7, 3-47
randomized experiment
An experiment where experimental units are randomly assigned to two or more treatment
conditions and the explanatory variable is actively imposed of the subjects. ................ 4-10, 4-15
relative risk
The ratio of conditional proportions .................................................................................. 5-5, 5-8
representative
Describes a sample with statistics similar to the parameters in the entire population. Simple
random samples are representative; convenience samples may not be representative ... 3-1, 3-4
residuals
The vertical distances between a point and the least squares regression line .......... 10-24, 10-26
resistant
A statistic is resistant if its value does not change considerably when extreme observations are
removed from a data set .............................................................................................. 3-25, 3-34
response rate
Of those selected to be in the sample, the percentage that respond. .................................... 3-43
response variable
The variable that, if the alternative hypothesis is true, is impacted by the explanatory variable;
sometimes known as the dependent variable. ......................................................................... 4-3
sample
The subgroup of the population on which we record data ......................................... 1-4, 3-1, 3-2
sample size
The number of observational units in the sample .................................................................... 1-4
sampling frame
A list of all of the members of the population of interest ................................................. 3-4, 3-14
sampling variability
The amount that a value changes as it is observed repeatedly ............................................... 3-6
segmented bar graphs
Graphical display of conditional proportion from two-way table ............................................... 5-4
significance
Are the sample results unlikely to have arisen by chance alone ............................................ 6-26
significance level
A value used as a criterion for deciding how small a p-value needs to be to provide convincing
evidence against the null hypothesis .............................................................................. 2-6, 2-10
simple random sample
When you randomly choose individuals from the sampling frame, so that each individuals has
the same probability of being selected into the sample ........................................... 3-4, 3-7, 3-14
skewed distrbution
The bulk of observations tend to fall on one side of the distribution ....................................... 3-24
slope
Change in predicted response variable divided by change in explanatory variable .. 10-22, 10-28
SSE
Sum squared error. It is the sum of all the squared residuals. ............................................. 10-27
An Introduction to Statistical Inference
standard deviation of p-hat
The standard deviation of the distribution of sample proportions can be shown mathematically to
follow the formula:  (1   ) / n . .......................................................................................... 1-55
standard error
Estimate for the standard deviation of the null distribution of the statistic .............. 2-21, 2-265-37
standardize
To standardize an observation, compute the distance of the observation from the mean and
divide by the standard deviation of the distribution. ...................................................... 1-35, 1-39
statistic
A number calculated from the observed data which summarizes information about the variable
or variables of interest ...................................................................................................... 1-4, 3-2
statistical significance
Results unlikely to have arisen by chance alone ..................................................................... 1-2
statistically significant
Unlikely to occur just by random chance ................................................................................. 1-3
strength of evidence
How much evidence we have against the null hypothesis ....................................................... 1-2
subjects
Study participants that are human ......................................................................................... 1-25
test of significance
A procedure for measuring the strength of evidence against a null hypothesis about the
parameter of interest ............................................................................................................. 1-17
transform
Express data on a different scale, such as logarithmic, often used to meet validity conditions 1046
t-standardized statistic
A standardized statistic (used with means) that follows a theoretical t distribution when the null
hypothesis is true. ................................................................................................................. 6-35
two by two (2 × 2)
A two-way table where the explanatory and response variables each have two categories .... 5-4
two-sided test
Estimates the p-value by considering results that are at least as extreme as our observed result
in either direction.......................................................................................................... 1-46, 1-52
two-way table
A tabular summary of two categorical variables, also called a contingency table .................... 5-3
Type I Error
Rejecting the null hypothesis when it is actually true (false alarm) ............................... 2-36, 2-39
Type II Error
Failing to reject a null hypothesis that is actually false (missed opportunity) ................. 2-36, 2-39
unbiased
A sampling method that, on average across many random samples, produces statistics whose
average is the value of the population parameter ............................................................. 3-6, 3-7
upper quartile
The value for which 25% of the data lie above ........................................................................ 6-3
validity condition
Check to see that certain conditions are met the render the theory-based approach valid. Often
these conditions deal with sample size and shape and variability of distributions. ................. 1-55
validity condition for one-sample z-procedures
At least 10 successes and at least 10 failures ....................................................................... 2-16
An Introduction to Statistical Inference
validity conditions for chi-square test
Each cell of the two=way table must have at least 10 observations ............................. 8-22, 8-26
validity conditions for one-sample t-test
A sample size of at least 20 or the distribution of the quantitative variable is not highly skewed ..
3-31, 3-39
validity conditions for Paired t test
Sample size of pairs is at least 20 OR the differences follow a normal distribution ................ 7-27
validity conditions for two-sample z-procedures
At least 10 observations in each of the cells of the 2x2 table................................................. 5-35
xbar
The sample average of a quantitative variable ........................................................................ 3-5
xbard:
Observed sample average of the differences .......................................................................... 7-5
z-statistic
z-statistic is synonymous with standardized sample proportion, also called the standardized
statistic. ................................................................................................................................. 1-55
(pi)
Greek symbol used for the unknown underlying process probability or true population
proportion. A parameter. ....................................................................................................... 1-19