Download Course Notes - Miles Finney

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Sufficient statistic wikipedia , lookup

Degrees of freedom (statistics) wikipedia , lookup

Foundations of statistics wikipedia , lookup

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Misuse of statistics wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
Course Notes
Introduction to Statistics


Statistics is a discipline defining a set of procedures used to collect and interpret
numerical data.
The discipline of statistics serves two purposes:
o Statistical procedures can be used to describe the relevant characteristics
(dispersion, central tendency) of a body of data. This is called Descriptive
Statistics .
o Statistical procedures can be utilized to help us make inferences or
predictions about a whole population based on information from a sample
of the population. This is called Inferential Statistics.
Measurement and Sampling








A population consists of all the observations with a given set of characteristics. It
is all the possible observations within the group the researcher is studying.
A sample is a portion of the population.
In inferential statistics, a sample is selected to represent the population studied.
A sample will, on average, be representative of the population if the procedure
used to select the sample is unbiased.
An unbiased sampling procedure is one in which each observation in the
population has an equal chance of being chosen for the sample.
The possible biasedness of a sampling procedure depends partially on the exact
definition of the population.
o For example if a researcher were studying CSULA male students, it would
not be biased to select a sample from among only males.
o If the researcher were studying CSULA students in general, then it would
be biased to select only males.
Random sampling is an unbiased procedure but there are other unbiased sampling
procedures.
Stratified sampling, in which the sample is purposely selected so that certain
characteristics of the sample matches that of the population, is not completely
random but nevertheless can be unbiased.
Distribution and the Visual Display of Data


A frequency distribution illustrates the number of observations in a data set that
fall into various classes of the data.
A category (or bin) is an interval of the data.
o Categories must be mutually exclusive. No observation in the data should
fall within more than one category.
o Categories must also be exhaustive. Each observation in the data must fall
within a class.



The number of observations falling into the various classes in the distribution is
called frequency (denoted fi).
A relative frequency distribution reveals the percentage of observations that fall
into the various classes of data.
A histogram is a graphical representation of a frequency or relative frequency
distribution.
Summary Description of Data








The summary information we usually want to know about a data set is the center
of the data (measure of centrality) and how disperse the data set is (measure of
dispersion).
A parameter is a numerical characteristic of the population.
A sample statistic is a numerical characteristic of the sample.
There are three measures of centrality: mean, median and mode.
o The mean (arithmetic average) is the most common measure of centrality.
o The mean of the population is denoted µ The mean of the sample is
denoted X .
o The median is the middle value of the data ordered from lowest to highest.
o The value in a data set that occurs with the greatest frequency is defined as
the mode.
o The value of the mean is sensitive to the outliers in a data set whereas the
median is not.
o The median will equal the mean if the distribution of the data is
symmetric.
A symmetric distribution is one in which the side of the distribution to the right of
the mean is a mirror image of the left portion.
If a distribution is skewed to the right, outlier values in the data much larger than
the mean are pulling the value of the mean above the median.
If a distribution is skewed to the left, outlier values in the data much smaller than
the mean are pulling the value of the mean below the median.
There are three measures of dispersion: range, variance and standard deviation.
o Variance and standard deviation are both measures of how a data set
varies with respect to its mean.
( xi   )2

2
o The variance of population data is:   i
N
( xi  x )2

2
o The variance of sample data is: S  i
n 1
o Standard deviation for either the population or the sample equals the
square root of the variance.
o The larger the variance or standard deviation, the greater the variation in
the data around its mean.


The coefficient of variation expresses standard deviation as a percentage of the
S
mean: CV  * 100%
x
The coefficient of variation is useful in comparing the variation of data sets that
have different means.
The Normal Distribution









A continuous distribution is represented by a smooth curve.
Probability is represented by the area under the curve.
The normal distribution is the most common continuous distribution in
statistics. Many variables in the social and natural world are normally distributed.
The normal distribution is a formula that draws a family of symmetric curves.
Each distinct normal curve has its own mean and variance.
Characteristics of the normal distribution:
o The normal curve is symmetric around the mean, µ.
o The normal curve extends from negative to positive infinity.
o The total area under the normal curve sums to one.
o The mean, median and mode of the distribution equal one another.
o Empirical Rule: If a variable follows a normal distribution,
o 68% of its observations will be within one standard deviation of its
mean.
o 95% of its observations will be within two standard deviations.
o 99% of its observations will be within three standard deviations.
The standard normal, or Z-distribution is a specific normal curve with a mean µ=0
and variance σ2=1.
The value of the variable Z represents standard deviations from the mean.
If the variable X represents individual observations in a population, the formula
x
Z 
transforms the variable into Z.

The Concept of Probability





A random experiment is any activity whose outcome cannot be predicted with
certainty (for example a coin toss).
Each possible outcome of an experiment is called a basic outcome.
An event is a collection of basic outcomes that share some characteristic.
o For example, if the experiment consisted of randomly selecting a student
in class, each student would represent a basic outcome whereas lefthanded students would exemplify an event.
An event (A) composed of three basic outcomes is denoted A={O1,O2,O3 }.
The probability of event A occurring, P(A), can be assigned using different
approaches.
o









Relative Frequency Approach. The experiment may be repeated n
number of times and fA, the frequency of event A could be observed. The
f
relative frequency, A could be used to approximate probability.
n
o Equally Likely (or Theoretical) Approach. If each basic outcome is
equally likely to occur, the probability of event A can be calculated as the
sum of the chances of its basic outcomes.
The Law of Large Numbers states that if an experiment is repeated through
many trials, the proportion of trials in which event A occurs will be close to the
probability P(A). The larger the number of trials the closer the proportion should
be to P(A).
A union of two events is composed of all those basic outcomes that belong to at
least one of the events.
An intersection of events is composed of those basic outcomes that fall in both
events simultaneously.
Conditional probability is the probability of an event occurring conditional on
another event having already arisen.
o Suppose A={females} and B={right-handed people}.
o A  B, the union of the events, consists of all those who are right-handed,
female or both.
o A  B, the intersection of the two events, consists of right-handed females.
o P( A B), the probability of event A conditional on event B, is the
probability of being female conditional on being right-handed.
o In an experiment that selects students, P( A B), would be the probability of
selecting a female if we chose only from right- handed students.
The formula to calculate the probability of the union of the events A and B:
P( A  B )  P( A)  P( B )  P( A  B ).
P( A  B )
The formula for conditional probability: P( A B ) 
.
P( B )
If events A and B are independent then the likelihood of one event occurring is
not a function of the other event.
If event A is independent of B, then P( A B)  P( A).
o In the example, under independence, our chance of selecting a female is
not altered if we condition our selection to those who are right handed.
If the two events are independent, the probability of the intersection of events A
and B is calculated as: P( A  B )  P( A)  P( B )
Discrete Probability Distribution


A Discrete Probability Distribution assigns probability to the
possible values of the discrete variable X.
The probabilities that make up any probability distribution must
sum to 1 (100%).




The notation expressing the probability that the variable X equals
its specific value xi is P(X=xi).
The variable X is considered discrete if we are able to observe and
count each of the different values of the variable.
o For example, student attendance in a statistics class over
the course of a quarter would be represent a discrete
variable. The different values of the variable would be
observable and countable.
The expected value, or mean, of the variable X can be calculated
using the information provided by its distribution.
The formula for the expected value of the discrete variable X
is E( X )   x   xi  P( X  xi ) .
i


The calculation of the mean for the discrete variable X differs from
the simple formula for the average because we are using the
information on probability given by the distribution, we are not
utilizing individual values of X directly in the calculation.
The formula for the variance of the discrete variable X is
 2   ( xi   )2 P( X  xi ) .
i


The standard deviation σ is the square root of the variance.
Variance and standard deviation are indexes measuring the degree
of variation in X.
From Samples to Population







Parameters are numerical characteristics of the population.
The numerical characteristics of a sample are called sample statistics.
o The mean, variance and standard deviation of the population are
denoted, respectively: µ, σ2, σ.
o The mean, variance and standard deviation of the sample are
denoted, respectively: X , S2, S.
Sample statistics serve as estimates of the population parameters.
If the sample is drawn from the population in an unbiased manner, the
sample mean, X , , is an unbiased estimate of the population mean, µ.
Unbiasedness means that, on average, the statistic X will equal the
population parameter µ.
In most cases there will be some difference between the sample statistic
and population parameter. This difference is called Sampling Error.
Characteristics of X .
o X is a variable. It is calculated from a sample and different
samples taken from a population will typically generate different
values of X .
o X follows a sampling distribution which assigns probability to the
different possible values of the variable.
o
The expected value of the sample mean is the population mean:
E ( X )  .
o
The variance of X is
2







2
and the standard deviation is
n

where n is the size of the sample taken to calculate X .
n
n
The variance and standard deviation of X can be estimated from sample
data, in which case the calculated variance and standard deviation would
S2
S
be
and
.
n
n
The variance and standard deviation of X decrease as sample size
increases. This is seen from the above formulas for the variance and
standard deviation in which n is in the denominator.
The distribution of X can be approximated by the normal curve (with a
specific mean and variance) if the sample size is at least thirty
observations. This holds regardless of the distribution of the population
that is sampled.
A population proportion, denoted P, is the percentage of observations
within a population that has a specific characteristic.
A sample proportion, denoted p̂, is the percentage of observations within a
sample that has a specific characteristic.
p̂, calculated from a sample, is a variable with the following
characteristics:
o The expected value of the sample proportion is the population
proportion, E( pˆ )  P.
o
o
Pq
Pq
and the standard deviation is
n
n
where n is sample size and q=1-P.
If nP≥5 or nq≥5 the distribution of p̂ can be approximated by the
normal distribution with the relevant expected value and variance.
The variance of p̂ is
Interval Estimation of Population Mean and Proportion



X is a point estimate for the population parameter µ.
The point estimate for µ does not utilize information provided by the
standard deviation of X on the possible magnitude of sampling error.
A confidence interval for µ utilizes this information by generating an
interval around X in which there is a 100(1-α)% probability that µ is
within the interval.
 1-α is the level of significance of the confidence interval, which is
set by the researcher. It gives the probability that the population
parameter will fall within the constructed interval.

The formula that generates the confidence interval for µ around our
sample X point estimate: X  Z  
2
n
where Z  is a value in the
2
standard normal distribution in which P( Z  Z  ) 
2




2
.

n
is the
standard deviation of X .
o The use of σ in the formula implies that we know the value of the
population variance.
o We are also to assume that the population that's sampled is
normally distributed.
o If these requirements do not hold, we can still use the Zdistribution to estimate the confidence interval at a given level of
significance if our sample size is at least thirty observations.
o In this case we would be estimating σ with the sample standard
deviation, S.
There is a tradeoff made between interval size and level of confidence.
o The greater the level of confidence (1-α) the larger the size of the
interval calculated around X .
o The researcher wants the precision of a small interval with a large
level of confidence.
p̂ is a point estimate for the population parameter P, the population
proportion.
The formula that generates the confidence interval for P, is pˆ  Z  ( pˆ qˆ )
where q̂ equals 1  p̂.
2
t-Distribution


If the population is normal but the variance is unknown and the sample size is less
than thirty, confidence intervals may be constructed using the t-distribution.
o Characteristics of the t-distribution:
o As in the case of the normal distribution, the t-distribution is a
formula that draws a family of symmetric curves.
o The mean of t is 0.
o The variance of the t-distribution is always greater than one.

o The variance of t is calculated as
where  equals n-1 and is
 1
called degrees of freedom.
The formula to calculate a confidence interval for µ at the 1-α level of

significance is, X  t  S
where t is a t-value in which P(t  t )  .
,
,
,
n
2
2
2
2
S is an estimate for σ.
Hypothesis Testing








A hypothesis test involves testing an idea the researcher has about the
value of a population parameter.
A hypothesis test always involves two competing ideas as to the value of
the population parameter. One hypothesis is placed within the null, H0, the
other is placed within the alternative, H1.
Going into the hypothesis test, the assumed value of the population
parameter is that which is placed within H0.
The hypothesis placed within H0 is considered in an advantaged position
since, normally, only if sample evidence strongly suggests H0 is incorrect
will the researcher reject H0.
In the formal test, if the test statistic falls within a predetermined range of
values (the critical or rejection region) the null hypothesis is rejected.
The alternative hypothesis determines whether the test will be a one tailed
or two tailed test. It will determine on which side the rejection region will
be for the one tailed test.
A type I error occurs if the researcher incorrectly rejects the null
hypothesis. A type II error occurs when the researcher incorrectly fails to
reject H0.
The researcher controls the probability of making a type I error by setting
the level of significance of the test.
Course Notes
The Normal Distribution







A distribution assigns probabilities to possible values of a variable.
The area under the curve utilized as a distribution represents probability.
The total area under any curve that serves as a distribution must sum to one.
The normal distribution is actually of collection of curves all derived from the
same formula.
The distributions of many variables can be represented by the symmetric normal
curve.
The Z-distribution is a specific normal distribution with a mean μ = 0 , and a
variance σ 2 = 1 .
An individual observation within a population (or sample) is represented by the
variable X.
The Sampling Distribution












A population consists of all observations with a particular set of characteristics.
A parameter is a numerical characteristic of the population.
A sample statistic is a numerical characteristic of the sample.
All sample statistics are considered variables since different samples will generate
different values of the statistic.
Sample statistics are used as estimates of population parameters.
If the sample is an unbiased drawing from the population, the sample mean, X , is
an unbiased estimate of the population mean, μ .
If the sample is an unbiased drawing from the population, the sample variance, S2,
is an unbiased estimate of the population variance, σ 2 .
If the sample is an unbiased drawing from the population, the sample standard
deviation, S, is an unbiased estimate of the population standard deviation, σ .
Typically there will be some difference between the population parameter, μ  and
the sample statistic X even if the statistic is calculated from an unbiased sample.
This difference is called Sampling Error.
The variance (standard deviation) of the variable X will always be greater than the
variance (standard deviation) of the sample mean, X .
2
 The variance of X is denoted σ .
σ2
 The variance of X is
.
n
An unbiased drawing of either variable X or X will be an estimate of the
population parameter, μ . The researcher would rather estimate μ with X because
the variance of the distribution for X is smaller than the variance for X.
The distribution of the variable X can be represented by the normal curve (with a
specific mean and variance) if the size of the sample taken to create X is at least
30 observations. This holds regardless of the distribution of the population that is
sampled.
Confidence Intervals



A confidence interval for μ is an interval generated around the sample statistic X
in which there is a 100(1-α)% probability that the one value of μ is within the
interval.
A confidence interval for the parameter μ provides information as to its possible
value that goes beyond the information provided by the point estimate X .
The size of the interval will vary with α
Hypothesis Tests








A hypothesis test involves testing an idea the researcher has about the value of a
population parameter.
A hypothesis test always involves two competing ideas as to the value of the
population parameter. One hypothesis is placed within the null, H 0,the other is
placed within the alternative, H1.
Going into the hypothesis test, the assumed value of the population parameter is
that which is placed within H0.
The hypothesis placed within H0 is considered in an advantaged position since,
normally, only if sample evidence strongly suggests H0 is incorrect will the
researcher reject H0.
In the formal test, if the test statistic falls within a predetermined range of values the critical or rejection region - the null hypothesis is rejected.
The alternative hypothesis determines whether the test will be a one tailed or two
tailed test. It will determine on which side the rejection region will be for the one
tailed test.
A type I error occurs if the researcher incorrectly rejects the null hypothesis. A
type II error occurs when the researcher incorrectly accepts H0.
The researcher controls the probability of making a type I error by setting the
level of significance of the test.
Prob-value of a Hypothesis Test




A prob-value is a measure of the test statistic's proximity to the value of the
population parameter stated within H0.
A prob-value is the probability of obtaining a sample statistic that is at least as
distant from the hypothesized population parameter as the test statistic was from
the parameter.
In general, the closer the test statistic is to the hypothesized population parameter,
the larger the calculated prob-value.
The larger the prob-value the more confident the researcher is that the null
hypothesis, H0, is true.
t-distribution and Hypothesis Test




In a hypothesis test for μ , if the sample comprises fewer than 30 observations and
the population variance must be estimated, the test statistic is more closely
approximated by the t-distribution. An additional condition to using the t under
these circumstances is that the population which is sampled must be normally
distributed.
The t-distribution is a collection of symmetric curves generated by a formula. The
distributions all have a variance greater than one.
The specific t-distribution chosen in performing a hypothesis test will depend on
the degrees of freedom of the test. In the hypothesis test for μ , the degrees of
freedom is calculated as n-1, where n is the sample size.
The critical value of the test utilizing the t-distribution depends on both the level
of significance of the test and the degrees of freedom.
Proportions and Hypothesis Tests







A proportion is the percentage of observations in a population (sample) that holds
a specific characteristic. It is a different way of obtaining summary information
about a population (sample).
The population proportion is denoted P. The sample proportion is denoted p̂ .
If the sample is an unbiased drawing from the population, the expected value of
the sample proportion, p̂ , is the population proportion P.
pq
The variance of p̂ is
where q is (1-p) and n is the sample size.
n
In the test of the difference in proportions, the equality of population proportions
is always assumed within the null hypothesis.
In performing a hypothesis test on the difference of two proportions, the test
statistic is pˆ 1 - pˆ 2 .
In the difference of two proportions test, the test statistic pˆ 1 - pˆ 2 is a variable since
each of the individual sample proportions is a variable.
Introduction to Regression




A simple regression estimates a relationship between the dependent variable, Y,
and one independent variable, X. Within a multiple regression a relationship is
estimated in which there is more than one independent variable.
The relationship between X and Y can be either stochastic or deterministic.
o In a stochastic relationship, a whole distribution of Y exists for each value
of X.
o In a deterministic relationship, there is just one value of Y for every X
value.
The distribution of Y for a given value of X is termed a subpopulation of Y.
Virtually all relationships between dependent and independent variables in the
social sciences are stochastic.






A relationship between X and Y can be positive or inverse.
o For a positive relationship, an increase (decrease) in the independent
variable X will cause the variable Y to increase (decrease).
o For an inverse relationship, changes in X will induce the variable Y to
move in the opposite direction.
For a stochastic relationship between X and Y, the independent variable is
modelled to determine the expected value of Y.
In the case of the simple linear regression, the population regression equation is
represented by, E (Y X = xi ) = β 0 + β1 X i
o The population regression equation calculates the expected (or mean)
value of Y that is associated with a specific value of X within the
population.
o 0 is the intercept of the regression line and 1 is the slope of the line.
o The parameters of the population equation, 0 and 1, can be calculated
only if all of the data within the specified population were utilized.
Otherwise the parameters would be estimated using sample data.
The above population regression equation is a correct equation only if the
relationship between Y and X is linear within the population and if the
dependent variable, Y, actually is a function of only one (independent)
variable.
The sample regression equation is calculated from sample data. The equation is
used as an estimate of the population regression equation.
The sample regression equation for the simple model is yˆ = b0 + b1 xi . ŷ is an
estimate for E (Y X = xi ) . b0 is an estimate for β0 and b1 is an estimate for β1.
n

b1 
x y
i 1
n
x
i 1



i
2
i
i
 nx y
 nx 2
b0 = y - b1 x
The difference within the population between an individual observation of the
dependent variable, Yi, and its conditional mean is called the error term, ei.
ei = Yi - E (Y X = xi )

The corresponding difference within the sample between an individual
observation of the dependent variable, yi, and its predictor is called the residual
term, eˆi = yi - yˆ i
Standard Error of the Regression Model

The formula to calculate the slope (b1) and intercept (b0) of the sample regression
line is one which minimizes the sum of squared errors (SSE).
n

n
SSE   ei2   ( yi  yˆ i ) 2
i 1
i 1
n

Residual terms, êi , should sum to zero,
 eˆ
i 1
i
 0 This implies that the average
difference between yi and its predictor, ŷi , is zero.



The variance of each subpopulation of Y equals σ e2 .
SSE
, where k is the
n - k -1
number of independent variables in the regression. In the simple regression, k
equals one.
Se is an estimate of σ e . Se is termed the standard error of the
σ e2 can be estimated using the sample data by, S e2 =
regression. (Se = Se2 )
R-square


R2 measures the proportion of variation in the dependent variable that is explained
by the model.
SSR
SSE
R2 =
= 1
SST
SST
SST is called the total sum of squares. It is the total variation in the dependent
variable. It is the variation for which the model is attempting to
n
account. SST   ( yi  y ) 2
i 1

SSR is termed the regression sum of squares. It is the variation in y that is
generated by the model. It is the variation in y that the linear model indicates is
n
caused by the independent variable(s). SSR   ( yˆ  y ) 2
i 1


SSE, the sum of squared errors, is the variation in y that is not accounted for by
the model.
R2, a proportion, must equal some value between zero and one.
Hypothesis Testing for Regression Parameters


In the simple linear regression, b1 and b0 are variables. Their values depend on the
specific sample that is taken. The expected value of b1 is β1. The expected value of
b0 is β0.
Each of the variables follow a sampling distribution.

The true standard deviation of the variables b0 and b1 are respectively denoted σ b
0
and σ b .
1





σ b is estimated by S b0 , which is calculated using sample data. σ b is estimated
0
1
by S b1 , which is also calculated using sample data.
The standard deviations of b0 and b1 are critical in performing hypothesis tests on
the respective population parameters β0 and β1.
Hypothesis tests on the population parameter 1 carry a special significance
because 1 is the relationship between the respective X variable and Y within the
population. The researcher normally theorizes a relationship between X and Y. A
regression is a way of testing the theory.
Typically the most important hypothesis test to perform on a slope parameter, 1,
is one in which the null is, H0: β1= 0. The failure to reject this H0 suggests no
relationship exists between the independent and dependent variable within the
population (at a given level of significance).
The t-distribution is always used in performing hypothesis tests for individual β
parameters.
Dummy Variables


A dummy variable is a qualitative variable. The variable accounts for the
existence of a characteristic. It does not measure the quantity of a characteristic.
Dummy variables usually take on 0 or 1 to account for the existence/nonexistence of a specific characteristic (for example pass/fail).