Download Terms

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Psychometrics wikipedia , lookup

Degrees of freedom (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Misuse of statistics wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
Stats 156 – Terms
Chapter 1
Population = the entire collection of individuals or objects about which information is desired
Sample = a subset of the population selected for study in some way
Categorical Data = data that is qualitative in nature
Numerical Data = data that is quantitative in nature
Continuous = the data is made up of some interval on the number line
Discrete = the data is a collection of separated points on the number line
Chapter 2
Bias = tendency for samples to differ from corresponding population
Selection Bias = some part of the population is systematically excluded from the sample
Measurement Bias = the method of observation produces values that differ from the population
Non-Response Bias = responses are not obtained from all members of the sample
** Selection Bias is the most common
Extraneous Factor = a factor not of interest in the current study, but is thought to affect the response
variable in some way
Confounded Factors = two factors whose effects on the response variable cannot be distinguished from
one another in any way
Chapter 3
Relative Frequency = the fraction of proportion of the time that a particular category is in the data set
Frequency Distribution = a table that displays the possible categories along with the associated
frequencies, relative frequencies, cumulative frequencies, and relative cumulative frequencies
Outlier = an unusually small or large data value
Dot Plot, Stem & Leaf, Histogram
Chapter 4
Mean = average of the observation values
Median = the middle value when the data is arranged sequentially
Mode = the observation value occurring most often
Standard Deviation = a “typical” deviation from the mean
Lower Quartile (Q1) = the median of the lower half of the data
Upper Quartile (Q3) = the median of the upper half of the data
Interquartile Range (IQR) = upper quartile – lower quartile
Outlier = a sample observation that lies more than 1.5(IQR) away from the upper or lower quartile, or an
observation that has |z-score| > 2
Chebyshev’s Rule = the percentage of observations that are within K standard deviations of the mean is
at least 100 1  1 2  %
k 

So at least 75% are within 1 standard deviation
At least 88.9% are within 2 standard deviations
At least 93.75% are within 3 standard deviations …
Empirical Rule = If the data is well represented by a normal curve, then 68% of the observations are
within 1 standard deviation of the mean, 95% are within 2, and 99.7% are within 3
Z-Score = how many standard deviations an observation is from the mean
5 Number Summary of Data Set = min, Q1, median, Q3, max
Chapter 5
Residual = difference between observed value and regression model value
(Pearson’s) Correlation Coefficient = measures how closely the data fall on a straight line
Coefficient of Determination = tells what percent of the variation can be explained by the linear
relationship between x and y
Chapters 6 & 7
Chance Experiment = any activity or situation in which there is uncertainty about which of two or more
possible outcomes will result
Probability of Event E (P(E)) = the ratio of the number possible ways to get E to the total number of
possible outcomes
Properties of Probability
1. For any event E, 0  P( E )  1
2. If S is the entire sample space, P(S) = 1
3. If events E and F are disjoint, then PE  F   P( E )  P( F )
otherwise PE  F   P( E )  P( F )  PE  F 
4. For any event E, P( E )  1  P( E )
Random Variable = a numerical variable whose value depends on the outcome of a chance experiment
Discrete = possible outcomes are isolated points on a number line
Continuous = possible outcomes are an entire interval on the number line
Probability Distribution (discrete x) = probability of each possible outcome of the random variable x
Probability Distribution (continuous x) = a function (sometimes called the density function of the
variable x) with the properties that f ( x)  0 for all x and
In this case, Pa  x  b  

b
a



f ( x)dx  1
f ( x)dx
Normal Distribution = any probability distribution that has a bell shape curve as its density function
Standard Normal Distribution = the normal distribution (bell shapes curve) with mean   0 and
standard deviation   1
*Calculator can find probabilities for normal distributions with mean  and standard deviation 
Chapter 8
Statistic = any value computed from values in a sample
Sampling Distribution = the probability distribution for a statistic
x Sampling
x Sampling = for samples of size n in a population, we find x for each sample and the variable x
becomes the random variable of interest
General Properties of an x Sampling Distribution
Let x denote the mean of the observations in a random sample of size n from a population having
mean  and standard deviation .
1.  x  
2.  x2 
2
n
 x 

n
3. When the population is normal, the sampling distribution of x is also normal for any size n
4. For large values of n n  30  , the sampling distribution of x is approximately normal
regardless of whether or not the population itself is normal = Central Limit Theorem
Proportion Sampling
S = Success = an individual or object has a specific property that is under investigation
F = Failure = does not have the property
 = the proportion of successes in the entire population
General Properties of a Proportion Sampling Distribution
Let p be the proportion of successes in a random sample of size n from a population whose proportion
of S’s is . Denote the mean value of p by  p and the standard deviation of by  p . Then:
1.
p = 
2.  p 
 (1 )
n
3. When n is large and  is not too close to 0 or 1, the sampling distribution of p is
approximately normal.
A conservative rule says if n >= 10 and n(1 – ) >= 10, then it is safe to use a normal
approximation.
Chapter 9
Point Estimation = a single number that is based on sample data and represents a plausible value of the
characteristic for the entire population
Unbiased Statistic = a statistic with sampling mean value equal to the value of the population of the
characteristic being estimated (otherwise it is called biased)
x is an unbiased point estimate of 
p is an unbiased point estimate of 
s2 is an unbiased estimate of 2
s is a biased estimate of 
When we have several unbiased estimates to choose from, we always want the one having
minimum variance (called the minimum variance estimate).
Confidence Interval = an interval of plausible values for the characteristic being studied; it is constructed
so that, with a chosen degree of confidence, the value of the characteristic will be captured inside the
interval
Confidence Level = the success rate of the method used to construct a confidence interval
Large Sample Confidence Interval for 
1. Let p be the sample population and
2. Let n be the sample size
p1  p 
n
Where z* is the z – value corresponding to the desired confidence level: 80% → 1.282
90% → 1.645
95% → 1.96
98% → 2.326
99% → 2.576
z Confidence Interval for 
Used when the population standard deviation is known.
  
 where z* is as above
The confidence interval for  is x  z * 


 n
 
The confidence interval for  is p  z *
 
t Distributions
1. The t curve corresponding to any number of degrees of freedom is bell shaped and centered at 0
2. Each t curve is more spread out than the z curve
3. As the # of degrees of freedom increases, the spread of the corresponding t curves decreases
4. As the # of degrees of freedom increases, the corresponding sequence of t curves becomes closer
and closer to the z curve
t Confidence Interval for 
Used when the population standard deviation is unknown. We instead use the sample standard
deviation and a t distribution rather than the normal distribution.
 
 s 

The confidence interval for  is x  t * 
 n
Chapter 10 – 1 Variable Hypothesis Tests
Null Hypothesis (H0) = a claim about a population characteristic that is assumed to be true
H0 = hypothesized value
Alternative Hypothesis (Ha) = a statement that the null hypothesis is not true in some way
Ha > hypothesized value
Ha < hypothesized value
Ha ≠ hypothesized value
Type I Error = rejecting H0 when it is in fact true
Type II Error = failing to reject H0 when it is in fact not true
Level of Significance = the probability of a type I error (denoted )
Note: the probability of a type II error is denoted 
As  gets smaller,  automatically gets larger. We will focus on controlling the value of .
Test Statistic = the function of sample data on which a conclusion to reject or accept H0 is based
For population proportions, the test statistic is p = sample proportion
For population means, the test statistic is x = sample mean
P-Value = the probability that an observed value (or a more extreme value) will occur assuming H0 is true
The smaller the P-value, the stronger the evidence to reject H0 (we do this if P-value <= )
The larger the P-value, the stronger the evidence to accept H0 (we do this if P-value > )
We compare the P-value to the level of the test () to determine whether to reject H0 or not
P-Value for a Population Proportion Hypothesis Test
If we have that np  10 and n p  1  10 :
Given H0:  = 0, a sample proportion p, and  p 
1. Ha:  > 0
normalcdf p, 1010 ,  0 ,  p



3. Ha:  ≠ 0
2* normalcdf  1010 , p,  0 ,  p

n

2. Ha:  < 0
normalcdf  1010 , p,  0 ,  p

 0 (1   0 )
or 2* normalcdf p, 1010 ,  0
 if p < 
,   if p > 
p
0
0
Note: This can be done in the TI-83 using the 1-PropZTest
5 Pieces of a Hypothesis Test
1. H0
2. Ha
3. Test statistic (p = # or x = #)
4. P-Value
5. Conclusion
One-Sample t Test (or z Test) for a Population Mean
If we have that n  30 :
s
Given H0:  = 0, a sample mean x , and  x 
1. Ha:  > 0
tcdf t * , 1010 , df

n

t-curve with
n-1 df
2. Ha:  < 0
tcdf  1010 , t * , df


t-curve with
n-1 df
3. Ha:  ≠ 0
2* tcdf  1010 , t * , df


*
or 2* tcdf t , 10
10
 if x < 
, df  if x > 
0
0
t-curve with
n-1 df
** In Calculator = ZTest or TTest
Chapter 11 – 2 Variable Hypothesis Tests
Independent Samples = two samples in which the selection of on sample in no way affects the selection
of the other sample
Paired Samples = observations from the first sample are in some meaningful way paired with
observations from the second sample
Sampling Distribution for x1  x 2
Suppose two quantities, A and B, are normal. Then A – B is normal with mean  A B   A   B and
standard deviation  A B   A2   B2
So the distribution of x1  x 2 is normal with mean  x1 x2  1   2 and
standard deviation  x1  x2 
 12
n1

 22
n2
Two Sample t Test for 2 Population Means
Assuming n1 and n2 both >=30 (or both populations are normal) and the samples were selected
independently:
H0: 1   2 = 0
Ha: One of:
1   2
1   2
1   2
Test Statistic: t as shown below
P-Value: One of:
Where t 
Area to right of computed t under the t curve
Area to the left of computed t under the t curve
Sum of areas to the right of computed t and left of –(computed t)
x1  x 2  1   2 
s12 s 22

n1 n 2
and df =
 s12 s 22
 
n
 1 n2
2




2
 s12 
 s 22 
 n 
 n 
1
2


n1  1
n2  1
** In Calculator = 2-SampZTest or 2-SampTTest
2
rounded down
Paired t Test for Comparing 2 Population Means
Assuming samples are paired, the n sample differences can be viewed as a random sample from a
population of differences, and n >= 30 or both populations are normal:
H0:  d  1   2  hypothesized value
Ha: One of
 d  hypothesized value
 d  hypothesized value
 d  hypothesized value
Test Statistic: t as shown below
P-Value: One of:
Area to the right of calculated t under t curve with n – 1 df
Area to the left of calculated t under t curve with n – 1 df
Sum of areas to the right of t and left of –t
Where t 
x d  hypothesized value
and df = n – 1
sd
n
** In Calculator = ZTest or TTest (since we are back to one random variable)
Large Sample z Test for 2 Population Proportions
Assuming independent samples, n1p1 >= 10, n1(1 – p1) >= 10, n2p2 >= 10, and n2(1 – p2) >= 10:
H0: 1 - 2 = 0
Ha: One of
1   2  0
1   2  0
1   2  0
Test Statistic:
z
p1  p 2
pc (1  pc ) p c (1  p c )

n1
n2
where pc 
n1 p1  n2 p 2
n1  n2
P-Value: Upper, lower, or two tailed area under the z curve (just as prior proportion tests)
** In Calculator = 2-PropZTest
Distribution Free = procedures that do not require any overly specific assumptions about the population
distributions
Rank Sum Test
Assuming: The samples are randomly collected or the two treatments are randomly assigned to
individuals, and the two population distributions have the same shape and spread (but not necessarily
normal)
H0: 1   2 = 0
1   2 Upper tail test
1   2 Lower tail test
1   2 Two tailed test
Ha: One of:
Test Statistic:
Rank sum = sum of the ranks assigned to the observations in the first sample
P-Value: Found from table on page 817 of Peck, Olsen, Devore
**Rank:
1. List all observations (from both samples) from smallest to largest.
2. Rank them: smallest = 1, next smallest = 2, …
Ties: rank each as the average of the positions in the list
i.e. if 50 was both the 4th and 5th observation, each 50 would have rank 4.5
Chapter 12 – X2 Hypothesis Tests
Goodness of Fit Statistic = X2 = a quantitative measure of the extent to which the observed counts differ
from those expected when the null hypothesis is true
X2 

observed count  expected count 2
all cells
expected count
If all expected counts are >= 5, then the distribution has a 2 (chi – squared) probability distribution.
Goodness of Fit Test
 1  hypothesized value 1
 2  hypothesized value 2
H0:  3  hypothesized value 3
 k  hypothesized value k
Ha: H0 is not true
Test Statistic: X2 as defined above
P-Value: Probability of getting X2 or larger in a chi-squared distribution with df = k – 1
**I wrote program Chi2 in Calculator to do this
Two Way Frequency Table (Contingency Table) = rectangular table (matrix) that consists of a row for
each possible value of x and a column for each possible value of y where x and y are two random
categorical variables and each entry in the matrix is the frequency count (cell count) of that particular
(x, y) combination
Marginal Totals = sum of a row or a column
Grand Total = sum of all entries
Expected Cell Count = what would be expected when there is no difference between the groups or
experiments under study
Expected cell count =
row marginal totalcolumn
marginal total
grand total
Comparing Two or More Populations Using X2 Statistic
Assuming the samples are chosen independently and the sample size is large (each expected count is at
least 5)
H0: true category proportions are the same for all populations (population homogeneity)
Ha: true category proportions are not the same for all populations
Test Stat: X 2 

all cells
observed count  expected count 2
expected count
P-Value = area to the right of X2 under the chi-squared curve with df = (#row – 1)(#columns – 1)
** This same test can be used to check the independence of 2 categorical variables.
In Calculator: Put data into a matrix, and then use X2-Test…
Chapter 13 – Regression Analysis
Deterministic Relationship = is one in which the value of y is completely determined by the value of x
Probabilistic Model = a description of the relationship between two variable x and y that are not
deterministically related
Additive Probabilistic Model: y = deterministic function of x + random deviation (called e)
Simple Linear Regression = assumes that there is a line with y-intercept  and slope ; this line is called
the population regression line
y =  + x + e
Notes:
1. e has normal distribution
2. e has mean 0 and standard deviation  for any particular x – value
3. the random deviations (e1, e2, …, en) associated with different observations are
independent of one another
Estimating the Regression Line

For a collection of points (x, y), we find the regression y  a  bx where a is the point estimate of  and b
is the point estimate of .
a and b are “chosen” so that

 y  y
2
is as small as possible (this is a calculus problem – see
formula sheet for more details); this is called the least squares regression line and is the line given by
the calculator when a regression is done using it
Coefficient of Determination = the proportion of observed y variation that can be explained by the
model relationship (see formula sheet for calculation)
Sample Correlation Coefficient (r) = a measurement of how strongly the x and y values in a sample are
linearly related to one another
Population Correlation Coefficient () = a measurement of how strongly the x and y values in the entire
population are linearly related to one another
***We use r to make inferences about .
Bivariate Normal Distribution
 for any fixed value of x, the distribution of the associated y – values is normal
 for any fixed value of y, the distribution of the x – values is normal
Test for Independence of Two Numerical Variables in a Bivariate normal population
Assuming r is the correlation coefficient for a random sample from a bivariate normal population:
H0:  = 0 (variables are independent)
 0
Ha: One of   0
 0
Test Statistic: t 
(variables are not independent)
r
1 r2
n2
P-Value: Area (to the right, left, or both ends) under the t curve with df= n – 2
In Calculator: use LinRegTTest…
Chapter 14 – Multivariable Regression
General Additive Multiple Regression Model
Relates a dependent variable y to k predictor variables x1, x2, …, xk by the model equation
y    1 x1   2 x 2  ...   k x k  e where the random deviation e is assumed to be normally
distributed with mean 0 and variance 2 for any particular values of the predictor variables (which
implies that for fixed values of the predictor variables, y has normal distribution with variance 2)
Population Regression Coefficients
The ’s in the above regression model
Each  i represents how the y – value would change if the corresponding xi is increased by 1 unit and
all other predictor variables are held constant
Population Regression Function
y    1 x1   2 x 2  ...   k x k = the mean y value for fixed values of the predictor variables
Chapter 15
ANOVA = analysis of variance – checking whether the mean for more than 2 populations are identical
Single Factor Analysis of Variance = comparison of k population or treatment means 1 ,  2 , ...,  k
ANOVA Notation
N = n1 + n2 + … + nk = total number of observations in the data set
T = n1 x1  n 2 x 2  ...  n k x k = grand total = sum of all observations
T
N
x = grand mean =
Treatment Sum of Squares (SSTr)
A measurement of the amount of variation from group to group
2
2
SSTr  n1  x1  x   n2  x 2  x   ...  nk  x k  x 






2
(This has df = k – 1)
Error Sum of Squares (SSE)
A measurement of the amount of variation within each group
SSE  n1  1
s12
 n2  1
s 22
 ...  nk  1
s k2
Mean Squares = a sum of squares  its df
MSTr 
SSTr
= mean square for treatments
k 1




2
x  xi   (This has df = N – k)
 

i 1  x in
 group i

k
 
MSE 
SSE
= mean square for error
N k
Single-Factor ANOVA Test
Essentially checking whether the variation within the group is the same as the variation from group to
group – if they are “the same” then it is likely that H0 is true
Assuming
1. each of the k populations is normal
2.  1   2  ...   k
(good enough if largest sample deviation is ≤ 2(smallest sample standard deviation))
3. observations in a given sample are independent of one another
4. data is collected in a random manner
H0: 1   2  ...   k
Ha: at least 2 of the ’s are different
ANOVA
Source
df
SS
MS
F
Factor
k–1
SSTr
MSTr
MSTr/MSE
Error
N–k
SSE
MSE
Total
P-Value = area of the upper tail of the F curve with df1 = k – 1 and df2 = N – k
Total Sum of Squares (SSTo)
SSTo 

 x  x 


2
Fundamental Identity for a Single-Factor ANOVA
SSTo = SSTr + SSE