* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Document related concepts
OBJECTIVE, DATA, DESCRIPTIVE STATISTICS OBJECTIVE The first step in an empirical study is to clearly state the objective. The objective is to use a sample of data to make conclusions about an economic relationship in a population. The population is the entire group of units in which you are interested. The sample is a subset of units from the population. The unit can be an individual, household, firm, city, county, state, nation, etc. DATA The next step is to obtain a sample of data. The process that generates this data can be a controlled or uncontrolled experiment. The data can be experimental or observational data. Random Sample or Nonrandom Sample? Once we have our sample, we must ask: Is this a random sample or is nonrandom sample? A random sample will be representative of the population. A nonrandom sample may over represent or under represent different groups in the population. Sample selection results in a nonrandom sample. Consequences of a Nonrandom Sample A nonrandom sample can cause bias. Bias is a systematic error when making a conclusion about a population. Sources of Sample Selection Three important sources of sample selection. 1) Nonrandom method of sampling 2) Nonresponse 3) Response Error. A nonrandom sample selects units from the population into the sample by a systematic choice process, not chance. Nonresponse occurs when units selected into the sample choose not to respond or cannot be contacted. Response error occurs when units in the sample given incorrect information, or the information is incorrectly coded. DESCRIPTIVE STATISTICS: GET TO KNOW THE DATA The next step in an empirical study is to get to know the data. To get to know the data, we must do the following. 1) Get to know the variables. 2) Get to know the observations. 3) Organize, summarize, and describe the data. This is called the descriptive study of data. The descriptive study of data involves describing the characteristics of the sample data, not drawing conclusions about the population. GET TO KNOW THE VARIABLES A variable is a quantifiable characteristic of a unit. It is a known number that can differ from unit to unit. Quantitative Variable A variable that can be measured numerically on a well defined scale. Qualitative Variable A variable that indicates the presence or absence of a quality or characteristic of a unit. It has as many categories as possible characteristics. Quantifying a Qualitative Variable To quantify a qualitative variable, one or more artificial variables are created. These artificial variables are called dummy variables. A dummy variable can take two values: 0 or 1. It takes a value of 1 if a characteristic is present and zero if absent. Discrete Variable A variable that can take a finite number of values on a given interval. Continuous Variable A variable that can take an infinite number of values on a given interval. Variable Measurement Variable measurement refers to how a variable is defined and the unit in which it is measured. GET TO KNOW THE OBSERVATIONS There are three possible types of observations on a unit. A univariate observation is a numerical value for single variable. A bivariate observation is two numerical values for two variables. A multivariate observation is three or more numerical values for three or more variables. The number of observations always equals the number of units in the sample. Let X, Y, and Z be variables. Let xt, yt, zt be values of the variables X, Y, Z for the tth unit in the sample. Then (xt) is a univariate observation; (xt, yt) is a bivariate observation; (xt, yt, zt) is a multivariate observation. ORGANIZE, SUMMARIZE, DESCRIBE THE DATA To organize, summary, and describe the data we do two things. 1) Look for patterns in the data. 2) Calculate descriptive statistics. Look for Patterns in the Data To look for patterns in the data, we use 2 statistical tools. 1) Frequency distribution and histogram for a single variable. 2) Scatter diagram for two variables. Calculate Descriptive Statistics A statistic is a quantifiable characteristic of a sample. It is a known number that can differ from sample to sample. A descriptive statistic is a numerical measure that describes a characteristic of the sample data. There are two types of descriptive statistics. 1) Univariate descriptive statistics. 2) Bivariate descriptive statistics. Univariate Descriptive Statistics Univariate descriptive statistics describe the characteristics of the data for a single variable. The two characteristics of most interest are the following. 1) Center of the data. (Measured by mean, median, or mode). 2) Dispersion of the data. (Measured by range, variation, variance, standard deviation, or coefficient of variation). Measures of Center of Data The most often used measure of the center of data is the sample mean. Let X denote a variable. Let xt denote the value of X for the tth unit in the sample. Let n denote the number of units/observations in the sample. The sample mean is Xbar = xi / n. Measures of Dispersion of Data The range is the difference between the maximum and minimum values of X. The sample variation (also called the total sum-of-squares) is TSS = (xi – xbar)2. The sample variance is s2 = (xi – xbar)2/ (n-1) = TSS / (n – 1). The sample standard deviation is s = s2. The sample coefficient of variation is CV = (s / xbar)100. Comparison of Measures of Dispersion The range is the least used measure because it wastes information about dispersion. Larger values of variation (TSS), variance (s2), and standard deviation (s) indicate more dispersion in the values of X about its mean. The advantage of standard deviation is that it can be interpreted as the average deviation of X from its mean. The major disadvantage of each of these measures is that they are not unit-free measures. We cannot use these measures to compare dispersion for two or more variables measured in different units. The major advantage of the coefficient of variation is that it is unit-free, and can be used to make such a comparison. Bivariate Descriptive Statistics Bivariate descriptive statistics describe the characteristics of the data for two variables. The characteristic of most interest is linear association between two variables. The most often used measures of linear association are the following. 1) Covariation. 2) Covariance. 3) Correlation coefficient. The basic idea for each of these measures is as follows. Two variables X and Y have a positive linear association if when X is above (below) its mean Y tends to be above (below) its mean. Two variables have a negative linear association if when X is above (below) its mean Y tends to be below (above) its mean. If two variables do not display this tendency, then they have no linear association. The stronger this tendency, the stronger the degree of linear association. The sample covariation of X and Y is: Sample Covariation = (xi – xbar)(yi – ybar). The sample covariance is sxy = (xi – xbar)(yi – ybar) / (n – 1). The sample correlation coefficient is rxy = sxy / sxsy . Advantages and Disadvantages of Covariation, Covariance, Correlation Coefficient The major advantage of these 3 measures is that they tell us something about the linear relationship between X and Y in the sample. If X and Y have a linear relationship in the sample, then they may be related in the population. These measures have 3 major disadvantages. 1) They tell us nothing about the causal relationship between X and Y in the sample, and therefore in the population. They can’t tell us if X has an independent causal effect on Y, and if so the direction and size of the effect. 2) They do not measure nonlinear association between X and Y. For example, if X and Y have a close U-shaped relationship, covariation, covariance, and correlation would be zero or close to zero. While it is true there is no linear association, there is a nonlinear association.