* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Course Notes - Miles Finney
Sufficient statistic wikipedia , lookup
Degrees of freedom (statistics) wikipedia , lookup
Foundations of statistics wikipedia , lookup
History of statistics wikipedia , lookup
Bootstrapping (statistics) wikipedia , lookup
Taylor's law wikipedia , lookup
Misuse of statistics wikipedia , lookup
Course Notes Introduction to Statistics Statistics is a discipline defining a set of procedures used to collect and interpret numerical data. The discipline of statistics serves two purposes: o Statistical procedures can be used to describe the relevant characteristics (dispersion, central tendency) of a body of data. This is called Descriptive Statistics . o Statistical procedures can be utilized to help us make inferences or predictions about a whole population based on information from a sample of the population. This is called Inferential Statistics. Measurement and Sampling A population consists of all the observations with a given set of characteristics. It is all the possible observations within the group the researcher is studying. A sample is a portion of the population. In inferential statistics, a sample is selected to represent the population studied. A sample will, on average, be representative of the population if the procedure used to select the sample is unbiased. An unbiased sampling procedure is one in which each observation in the population has an equal chance of being chosen for the sample. The possible biasedness of a sampling procedure depends partially on the exact definition of the population. o For example if a researcher were studying CSULA male students, it would not be biased to select a sample from among only males. o If the researcher were studying CSULA students in general, then it would be biased to select only males. Random sampling is an unbiased procedure but there are other unbiased sampling procedures. Stratified sampling, in which the sample is purposely selected so that certain characteristics of the sample matches that of the population, is not completely random but nevertheless can be unbiased. Distribution and the Visual Display of Data A frequency distribution illustrates the number of observations in a data set that fall into various classes of the data. A category (or bin) is an interval of the data. o Categories must be mutually exclusive. No observation in the data should fall within more than one category. o Categories must also be exhaustive. Each observation in the data must fall within a class. The number of observations falling into the various classes in the distribution is called frequency (denoted fi). A relative frequency distribution reveals the percentage of observations that fall into the various classes of data. A histogram is a graphical representation of a frequency or relative frequency distribution. Summary Description of Data The summary information we usually want to know about a data set is the center of the data (measure of centrality) and how disperse the data set is (measure of dispersion). A parameter is a numerical characteristic of the population. A sample statistic is a numerical characteristic of the sample. There are three measures of centrality: mean, median and mode. o The mean (arithmetic average) is the most common measure of centrality. o The mean of the population is denoted µ The mean of the sample is denoted X . o The median is the middle value of the data ordered from lowest to highest. o The value in a data set that occurs with the greatest frequency is defined as the mode. o The value of the mean is sensitive to the outliers in a data set whereas the median is not. o The median will equal the mean if the distribution of the data is symmetric. A symmetric distribution is one in which the side of the distribution to the right of the mean is a mirror image of the left portion. If a distribution is skewed to the right, outlier values in the data much larger than the mean are pulling the value of the mean above the median. If a distribution is skewed to the left, outlier values in the data much smaller than the mean are pulling the value of the mean below the median. There are three measures of dispersion: range, variance and standard deviation. o Variance and standard deviation are both measures of how a data set varies with respect to its mean. ( xi )2 2 o The variance of population data is: i N ( xi x )2 2 o The variance of sample data is: S i n 1 o Standard deviation for either the population or the sample equals the square root of the variance. o The larger the variance or standard deviation, the greater the variation in the data around its mean. The coefficient of variation expresses standard deviation as a percentage of the S mean: CV * 100% x The coefficient of variation is useful in comparing the variation of data sets that have different means. The Normal Distribution A continuous distribution is represented by a smooth curve. Probability is represented by the area under the curve. The normal distribution is the most common continuous distribution in statistics. Many variables in the social and natural world are normally distributed. The normal distribution is a formula that draws a family of symmetric curves. Each distinct normal curve has its own mean and variance. Characteristics of the normal distribution: o The normal curve is symmetric around the mean, µ. o The normal curve extends from negative to positive infinity. o The total area under the normal curve sums to one. o The mean, median and mode of the distribution equal one another. o Empirical Rule: If a variable follows a normal distribution, o 68% of its observations will be within one standard deviation of its mean. o 95% of its observations will be within two standard deviations. o 99% of its observations will be within three standard deviations. The standard normal, or Z-distribution is a specific normal curve with a mean µ=0 and variance σ2=1. The value of the variable Z represents standard deviations from the mean. If the variable X represents individual observations in a population, the formula x Z transforms the variable into Z. The Concept of Probability A random experiment is any activity whose outcome cannot be predicted with certainty (for example a coin toss). Each possible outcome of an experiment is called a basic outcome. An event is a collection of basic outcomes that share some characteristic. o For example, if the experiment consisted of randomly selecting a student in class, each student would represent a basic outcome whereas lefthanded students would exemplify an event. An event (A) composed of three basic outcomes is denoted A={O1,O2,O3 }. The probability of event A occurring, P(A), can be assigned using different approaches. o Relative Frequency Approach. The experiment may be repeated n number of times and fA, the frequency of event A could be observed. The f relative frequency, A could be used to approximate probability. n o Equally Likely (or Theoretical) Approach. If each basic outcome is equally likely to occur, the probability of event A can be calculated as the sum of the chances of its basic outcomes. The Law of Large Numbers states that if an experiment is repeated through many trials, the proportion of trials in which event A occurs will be close to the probability P(A). The larger the number of trials the closer the proportion should be to P(A). A union of two events is composed of all those basic outcomes that belong to at least one of the events. An intersection of events is composed of those basic outcomes that fall in both events simultaneously. Conditional probability is the probability of an event occurring conditional on another event having already arisen. o Suppose A={females} and B={right-handed people}. o A B, the union of the events, consists of all those who are right-handed, female or both. o A B, the intersection of the two events, consists of right-handed females. o P( A B), the probability of event A conditional on event B, is the probability of being female conditional on being right-handed. o In an experiment that selects students, P( A B), would be the probability of selecting a female if we chose only from right- handed students. The formula to calculate the probability of the union of the events A and B: P( A B ) P( A) P( B ) P( A B ). P( A B ) The formula for conditional probability: P( A B ) . P( B ) If events A and B are independent then the likelihood of one event occurring is not a function of the other event. If event A is independent of B, then P( A B) P( A). o In the example, under independence, our chance of selecting a female is not altered if we condition our selection to those who are right handed. If the two events are independent, the probability of the intersection of events A and B is calculated as: P( A B ) P( A) P( B ) Discrete Probability Distribution A Discrete Probability Distribution assigns probability to the possible values of the discrete variable X. The probabilities that make up any probability distribution must sum to 1 (100%). The notation expressing the probability that the variable X equals its specific value xi is P(X=xi). The variable X is considered discrete if we are able to observe and count each of the different values of the variable. o For example, student attendance in a statistics class over the course of a quarter would be represent a discrete variable. The different values of the variable would be observable and countable. The expected value, or mean, of the variable X can be calculated using the information provided by its distribution. The formula for the expected value of the discrete variable X is E( X ) x xi P( X xi ) . i The calculation of the mean for the discrete variable X differs from the simple formula for the average because we are using the information on probability given by the distribution, we are not utilizing individual values of X directly in the calculation. The formula for the variance of the discrete variable X is 2 ( xi )2 P( X xi ) . i The standard deviation σ is the square root of the variance. Variance and standard deviation are indexes measuring the degree of variation in X. From Samples to Population Parameters are numerical characteristics of the population. The numerical characteristics of a sample are called sample statistics. o The mean, variance and standard deviation of the population are denoted, respectively: µ, σ2, σ. o The mean, variance and standard deviation of the sample are denoted, respectively: X , S2, S. Sample statistics serve as estimates of the population parameters. If the sample is drawn from the population in an unbiased manner, the sample mean, X , , is an unbiased estimate of the population mean, µ. Unbiasedness means that, on average, the statistic X will equal the population parameter µ. In most cases there will be some difference between the sample statistic and population parameter. This difference is called Sampling Error. Characteristics of X . o X is a variable. It is calculated from a sample and different samples taken from a population will typically generate different values of X . o X follows a sampling distribution which assigns probability to the different possible values of the variable. o The expected value of the sample mean is the population mean: E ( X ) . o The variance of X is 2 2 and the standard deviation is n where n is the size of the sample taken to calculate X . n n The variance and standard deviation of X can be estimated from sample data, in which case the calculated variance and standard deviation would S2 S be and . n n The variance and standard deviation of X decrease as sample size increases. This is seen from the above formulas for the variance and standard deviation in which n is in the denominator. The distribution of X can be approximated by the normal curve (with a specific mean and variance) if the sample size is at least thirty observations. This holds regardless of the distribution of the population that is sampled. A population proportion, denoted P, is the percentage of observations within a population that has a specific characteristic. A sample proportion, denoted p̂, is the percentage of observations within a sample that has a specific characteristic. p̂, calculated from a sample, is a variable with the following characteristics: o The expected value of the sample proportion is the population proportion, E( pˆ ) P. o o Pq Pq and the standard deviation is n n where n is sample size and q=1-P. If nP≥5 or nq≥5 the distribution of p̂ can be approximated by the normal distribution with the relevant expected value and variance. The variance of p̂ is Interval Estimation of Population Mean and Proportion X is a point estimate for the population parameter µ. The point estimate for µ does not utilize information provided by the standard deviation of X on the possible magnitude of sampling error. A confidence interval for µ utilizes this information by generating an interval around X in which there is a 100(1-α)% probability that µ is within the interval. 1-α is the level of significance of the confidence interval, which is set by the researcher. It gives the probability that the population parameter will fall within the constructed interval. The formula that generates the confidence interval for µ around our sample X point estimate: X Z 2 n where Z is a value in the 2 standard normal distribution in which P( Z Z ) 2 2 . n is the standard deviation of X . o The use of σ in the formula implies that we know the value of the population variance. o We are also to assume that the population that's sampled is normally distributed. o If these requirements do not hold, we can still use the Zdistribution to estimate the confidence interval at a given level of significance if our sample size is at least thirty observations. o In this case we would be estimating σ with the sample standard deviation, S. There is a tradeoff made between interval size and level of confidence. o The greater the level of confidence (1-α) the larger the size of the interval calculated around X . o The researcher wants the precision of a small interval with a large level of confidence. p̂ is a point estimate for the population parameter P, the population proportion. The formula that generates the confidence interval for P, is pˆ Z ( pˆ qˆ ) where q̂ equals 1 p̂. 2 t-Distribution If the population is normal but the variance is unknown and the sample size is less than thirty, confidence intervals may be constructed using the t-distribution. o Characteristics of the t-distribution: o As in the case of the normal distribution, the t-distribution is a formula that draws a family of symmetric curves. o The mean of t is 0. o The variance of the t-distribution is always greater than one. o The variance of t is calculated as where equals n-1 and is 1 called degrees of freedom. The formula to calculate a confidence interval for µ at the 1-α level of significance is, X t S where t is a t-value in which P(t t ) . , , , n 2 2 2 2 S is an estimate for σ. Hypothesis Testing A hypothesis test involves testing an idea the researcher has about the value of a population parameter. A hypothesis test always involves two competing ideas as to the value of the population parameter. One hypothesis is placed within the null, H0, the other is placed within the alternative, H1. Going into the hypothesis test, the assumed value of the population parameter is that which is placed within H0. The hypothesis placed within H0 is considered in an advantaged position since, normally, only if sample evidence strongly suggests H0 is incorrect will the researcher reject H0. In the formal test, if the test statistic falls within a predetermined range of values (the critical or rejection region) the null hypothesis is rejected. The alternative hypothesis determines whether the test will be a one tailed or two tailed test. It will determine on which side the rejection region will be for the one tailed test. A type I error occurs if the researcher incorrectly rejects the null hypothesis. A type II error occurs when the researcher incorrectly fails to reject H0. The researcher controls the probability of making a type I error by setting the level of significance of the test. Course Notes The Normal Distribution A distribution assigns probabilities to possible values of a variable. The area under the curve utilized as a distribution represents probability. The total area under any curve that serves as a distribution must sum to one. The normal distribution is actually of collection of curves all derived from the same formula. The distributions of many variables can be represented by the symmetric normal curve. The Z-distribution is a specific normal distribution with a mean μ = 0 , and a variance σ 2 = 1 . An individual observation within a population (or sample) is represented by the variable X. The Sampling Distribution A population consists of all observations with a particular set of characteristics. A parameter is a numerical characteristic of the population. A sample statistic is a numerical characteristic of the sample. All sample statistics are considered variables since different samples will generate different values of the statistic. Sample statistics are used as estimates of population parameters. If the sample is an unbiased drawing from the population, the sample mean, X , is an unbiased estimate of the population mean, μ . If the sample is an unbiased drawing from the population, the sample variance, S2, is an unbiased estimate of the population variance, σ 2 . If the sample is an unbiased drawing from the population, the sample standard deviation, S, is an unbiased estimate of the population standard deviation, σ . Typically there will be some difference between the population parameter, μ and the sample statistic X even if the statistic is calculated from an unbiased sample. This difference is called Sampling Error. The variance (standard deviation) of the variable X will always be greater than the variance (standard deviation) of the sample mean, X . 2 The variance of X is denoted σ . σ2 The variance of X is . n An unbiased drawing of either variable X or X will be an estimate of the population parameter, μ . The researcher would rather estimate μ with X because the variance of the distribution for X is smaller than the variance for X. The distribution of the variable X can be represented by the normal curve (with a specific mean and variance) if the size of the sample taken to create X is at least 30 observations. This holds regardless of the distribution of the population that is sampled. Confidence Intervals A confidence interval for μ is an interval generated around the sample statistic X in which there is a 100(1-α)% probability that the one value of μ is within the interval. A confidence interval for the parameter μ provides information as to its possible value that goes beyond the information provided by the point estimate X . The size of the interval will vary with α Hypothesis Tests A hypothesis test involves testing an idea the researcher has about the value of a population parameter. A hypothesis test always involves two competing ideas as to the value of the population parameter. One hypothesis is placed within the null, H 0,the other is placed within the alternative, H1. Going into the hypothesis test, the assumed value of the population parameter is that which is placed within H0. The hypothesis placed within H0 is considered in an advantaged position since, normally, only if sample evidence strongly suggests H0 is incorrect will the researcher reject H0. In the formal test, if the test statistic falls within a predetermined range of values the critical or rejection region - the null hypothesis is rejected. The alternative hypothesis determines whether the test will be a one tailed or two tailed test. It will determine on which side the rejection region will be for the one tailed test. A type I error occurs if the researcher incorrectly rejects the null hypothesis. A type II error occurs when the researcher incorrectly accepts H0. The researcher controls the probability of making a type I error by setting the level of significance of the test. Prob-value of a Hypothesis Test A prob-value is a measure of the test statistic's proximity to the value of the population parameter stated within H0. A prob-value is the probability of obtaining a sample statistic that is at least as distant from the hypothesized population parameter as the test statistic was from the parameter. In general, the closer the test statistic is to the hypothesized population parameter, the larger the calculated prob-value. The larger the prob-value the more confident the researcher is that the null hypothesis, H0, is true. t-distribution and Hypothesis Test In a hypothesis test for μ , if the sample comprises fewer than 30 observations and the population variance must be estimated, the test statistic is more closely approximated by the t-distribution. An additional condition to using the t under these circumstances is that the population which is sampled must be normally distributed. The t-distribution is a collection of symmetric curves generated by a formula. The distributions all have a variance greater than one. The specific t-distribution chosen in performing a hypothesis test will depend on the degrees of freedom of the test. In the hypothesis test for μ , the degrees of freedom is calculated as n-1, where n is the sample size. The critical value of the test utilizing the t-distribution depends on both the level of significance of the test and the degrees of freedom. Proportions and Hypothesis Tests A proportion is the percentage of observations in a population (sample) that holds a specific characteristic. It is a different way of obtaining summary information about a population (sample). The population proportion is denoted P. The sample proportion is denoted p̂ . If the sample is an unbiased drawing from the population, the expected value of the sample proportion, p̂ , is the population proportion P. pq The variance of p̂ is where q is (1-p) and n is the sample size. n In the test of the difference in proportions, the equality of population proportions is always assumed within the null hypothesis. In performing a hypothesis test on the difference of two proportions, the test statistic is pˆ 1 - pˆ 2 . In the difference of two proportions test, the test statistic pˆ 1 - pˆ 2 is a variable since each of the individual sample proportions is a variable. Introduction to Regression A simple regression estimates a relationship between the dependent variable, Y, and one independent variable, X. Within a multiple regression a relationship is estimated in which there is more than one independent variable. The relationship between X and Y can be either stochastic or deterministic. o In a stochastic relationship, a whole distribution of Y exists for each value of X. o In a deterministic relationship, there is just one value of Y for every X value. The distribution of Y for a given value of X is termed a subpopulation of Y. Virtually all relationships between dependent and independent variables in the social sciences are stochastic. A relationship between X and Y can be positive or inverse. o For a positive relationship, an increase (decrease) in the independent variable X will cause the variable Y to increase (decrease). o For an inverse relationship, changes in X will induce the variable Y to move in the opposite direction. For a stochastic relationship between X and Y, the independent variable is modelled to determine the expected value of Y. In the case of the simple linear regression, the population regression equation is represented by, E (Y X = xi ) = β 0 + β1 X i o The population regression equation calculates the expected (or mean) value of Y that is associated with a specific value of X within the population. o 0 is the intercept of the regression line and 1 is the slope of the line. o The parameters of the population equation, 0 and 1, can be calculated only if all of the data within the specified population were utilized. Otherwise the parameters would be estimated using sample data. The above population regression equation is a correct equation only if the relationship between Y and X is linear within the population and if the dependent variable, Y, actually is a function of only one (independent) variable. The sample regression equation is calculated from sample data. The equation is used as an estimate of the population regression equation. The sample regression equation for the simple model is yˆ = b0 + b1 xi . ŷ is an estimate for E (Y X = xi ) . b0 is an estimate for β0 and b1 is an estimate for β1. n b1 x y i 1 n x i 1 i 2 i i nx y nx 2 b0 = y - b1 x The difference within the population between an individual observation of the dependent variable, Yi, and its conditional mean is called the error term, ei. ei = Yi - E (Y X = xi ) The corresponding difference within the sample between an individual observation of the dependent variable, yi, and its predictor is called the residual term, eˆi = yi - yˆ i Standard Error of the Regression Model The formula to calculate the slope (b1) and intercept (b0) of the sample regression line is one which minimizes the sum of squared errors (SSE). n n SSE ei2 ( yi yˆ i ) 2 i 1 i 1 n Residual terms, êi , should sum to zero, eˆ i 1 i 0 This implies that the average difference between yi and its predictor, ŷi , is zero. The variance of each subpopulation of Y equals σ e2 . SSE , where k is the n - k -1 number of independent variables in the regression. In the simple regression, k equals one. Se is an estimate of σ e . Se is termed the standard error of the σ e2 can be estimated using the sample data by, S e2 = regression. (Se = Se2 ) R-square R2 measures the proportion of variation in the dependent variable that is explained by the model. SSR SSE R2 = = 1 SST SST SST is called the total sum of squares. It is the total variation in the dependent variable. It is the variation for which the model is attempting to n account. SST ( yi y ) 2 i 1 SSR is termed the regression sum of squares. It is the variation in y that is generated by the model. It is the variation in y that the linear model indicates is n caused by the independent variable(s). SSR ( yˆ y ) 2 i 1 SSE, the sum of squared errors, is the variation in y that is not accounted for by the model. R2, a proportion, must equal some value between zero and one. Hypothesis Testing for Regression Parameters In the simple linear regression, b1 and b0 are variables. Their values depend on the specific sample that is taken. The expected value of b1 is β1. The expected value of b0 is β0. Each of the variables follow a sampling distribution. The true standard deviation of the variables b0 and b1 are respectively denoted σ b 0 and σ b . 1 σ b is estimated by S b0 , which is calculated using sample data. σ b is estimated 0 1 by S b1 , which is also calculated using sample data. The standard deviations of b0 and b1 are critical in performing hypothesis tests on the respective population parameters β0 and β1. Hypothesis tests on the population parameter 1 carry a special significance because 1 is the relationship between the respective X variable and Y within the population. The researcher normally theorizes a relationship between X and Y. A regression is a way of testing the theory. Typically the most important hypothesis test to perform on a slope parameter, 1, is one in which the null is, H0: β1= 0. The failure to reject this H0 suggests no relationship exists between the independent and dependent variable within the population (at a given level of significance). The t-distribution is always used in performing hypothesis tests for individual β parameters. Dummy Variables A dummy variable is a qualitative variable. The variable accounts for the existence of a characteristic. It does not measure the quantity of a characteristic. Dummy variables usually take on 0 or 1 to account for the existence/nonexistence of a specific characteristic (for example pass/fail).