Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Ronald Heck EDEP 768E: Seminar in Categorical Data Modeling (F2012) 1 rev. Aug. 24, 2012 Week 1: Class Notes Categorical data refer to a variable measured on a scale that simply classifies individuals into a limited number of groups (ethnicity, gender, political affiliation). Categorical variable are typically discrete, which means that can only take on a finite number of values. We typically summarized this type of data as a frequency distribution, which summarizes the number of responses in each category, as well as the probability of occurrence of a particular response category and the percentage of responses in each category. The probabilities in a table are proportions, and are computed by dividing the frequency of responses in a particular category by the total number of respondents. So the probability of being in the first category is 3/10 = 0.30. The probability of being in the second category is 2/10 or 0.20. The probability of being in the third category is 5/10 or 0.50. Group Frequency Valid 1.00 2.00 3.00 Total 3 2 5 10 Percent 30.0 20.0 50.0 100.0 Valid Percent 30.0 20.0 50.0 100.0 Cumulative Percent 30.0 50.0 100.0 Scales of Measurement. Measurement is applying a particular rule to assign numbers to objects or subjects for the purpose of differentiating between them on a particular attribute such as a test or some type of psychological construct (e.g., motivation). Four Characteristics: 1. Distinctiveness (if the numbers assigned to subjects or objects differ on the property being assigned such as party affiliation). We often call this nominal. 2. Magnitude [if the different numbers that are assigned can be ordered in a meaningful way such as from “not important” to “important” (1 to 5)]. Often referred to as ordinal. 3. Equal intervals [if equivalent difference between two numbers that are assigned to subjects or objects have an equivalent meaning (an example is a difference in temperature from 60 to 50 is the same as from 70 to 60)]. This is often referred to as an interval scale. 4. Absolute zero (if assigning a score of 0 to a person or object indicates the absence of the attribute being measured). This is often referred to as a ratio scale (which is uncommon in the social sciences). An example would be if we have no spelling errors or no typing errors (see Azen & Walker, 2011 for further discussion). Keep in mind this is not the same as a zero on a math test—which does not necessarily mean the student has no math ability (maybe the test was too hard). Ronald Heck EDEP 768E: Seminar in Categorical Data Modeling (F2012) 2 rev. Aug. 24, 2012 History of Categorical Methods The discussion refers back to early 1900s in a debate between Pearson and Yule. Pearson argued that categorical variables were “proxies” for continuous variables. Yule argued categorical variables were inherently discrete (Agresti, 1996). Pearson approached their analysis by approximating their relationship on an underlying continuum. Yule developed a measure of association that did not rely on approximating the underlying continuum. Both were partially right. Certainly ordinal data behave something like interval/ratio data. A variable like pass/ fail seems to relate to an underlying continuous scale (with a cut score). Others are likely discrete (the light is on or off). Probability Distributions (We will deal with primarily 4 distributions) Properties of a sampling distribution typically depend on an underlying distribution of the random variable of interest. An example is the sampling distribution of the mean, from which the probability of obtaining particular samples with particular mean values can be determined. For continuous variables this is based on the normal distribution. Random variable is used to describe the possible outcomes that a particular variable may take on. It is used to convey the fact that the outcome was obtained from some underlying random process. A probability distribution is a table or mathematical function that links the actual outcome obtained (i.e., from an experiment or a random sample) to the probability of its occurrence. For continuous variable, the assumption is that the values obtained are random observations that come from a normal distribution. When the outcome is categorical, it can come from distributions other than normal. Azen and Walker (2012) provide a nice introduction to examining these types of probability distributions. 1. Bernoulli Distribution (simplest probability distribution for discrete variables). A discrete variable can only take on one of two values (0 = fail, 1 = pass). The probability is based on the proportion of the category of interest (typically coded 1) versus the probability of the other event occurring (coded 0). The probability is often shown as π in the population and p in a sample. Consider the following sampling distribution of a dichotomous outcome (being admitted =1, not admitted = 0). Group Frequency Valid .00 1.00 Total 7 3 10 Percent 70.0 30.0 100.0 Valid Percent 70.0 30.0 100.0 Cumulative Percent 70.0 100.0 The probability of the event coded 1 occurring (π) is then 3/10 or 0.30, while the probability of the event coded 0 occurring is 1- π or 0.70 in the sample above. Ronald Heck EDEP 768E: Seminar in Categorical Data Modeling (F2012) 3 rev. Aug. 24, 2012 The mean (μ), or expected value, is π and the variance can be calculated as π (π-1), which in this case, will be 0.3(0.7) or 0.21. You can see that in this type of distribution the variance and the mean cannot be independent—that is, the variance is tied to the mean. This is one key difference between the mean and variance of discrete data versus continuous data (from a normal distribution), where in the latter case, we can easily conceive of a sample distribution with mean of 20 and SD of 2 versus a distribution with the same mean of 20 but a standard deviation of 10. 2. Binomial Distribution. This is similar to a Bernoulli distribution except that where the Bernoulli distribution deals with only one “trial” or event, the binomial distribution deals with two possible outcomes but more than one trial (denoted as n). The classic case is flipping a coin. In a binomial distribution, the trials are considered as independent. So if you think about flipping a coin, over a large number of trials the probability of obtaining heads is 0.50. Of course, in the short run, like 10 flips, you might receive 8 heads and only 2 tails. The mean and variance of a binomial distribution is μ = nπ and the variance is nπ(1- π). In general for a series of n independent trails, we can estimate the probability as n k P(Y = k) = πk(1-π)n-k n k where is read “n choose k” and refers to the number of ways that k objects ( or individuals) can be selected from a total of n objects, which can be represented as follows: n! k !(n k )! If in 3 trials (n = 3) we want the probably of three heads (k = 3), it would be the following: 3 3 P(Y = 3) = 0.53(0.5)3-3 = 3! 0.53(1) = 0.125 (which is the same as ½ x ½ x ½ = 1/8 = 3!(0)! 0.125) (Note: 0! = 1) Let’s say now we want to know the probably of being proficient for two (k=2) of three students randomly chosen (n = 3) in a particular class. Say the probably of being proficient is 0.7. The probably of being proficient for two of three randomly selected students is then 3 2 P(Y = 2) = 0.72(0.3)3-2 = 3! (0.72)(0.3)3-2 =3( 0.72)(0.3)1 = 0.441 2!(3 2)! 3. Multinomial Distribution. The multinomial distribution represents the multivariate extension of the binomial distribution when there are I possible outcomes as opposed to only two Ronald Heck EDEP 768E: Seminar in Categorical Data Modeling (F2012) 4 rev. Aug. 24, 2012 outcomes. In the multinomial distribution each trial results in one of I outcomes, where I is a fixed finite number and the probably of each possible outcome can be expressed by π1, π2,…, πI such that the sum of all probabilities is 1. This then provides the probability of obtaining a specific outcome pattern across I categories in n trials. P(Y1 k1 , Y2 k2 ,..., andYI k I ) = n! n! 1k1 2k2 Lkl k1 ! k2 ! k L ! I Note that the sum of the probabilities for the categories is i 1 i 1. If we go back to our example at the top, suppose we select 3 students and want to know the probably there will be one in Group 1, Group 2, and Group 3. In this multinomial distribution, this is given by: P(Y1 = 1, Y2 =1, Y3 =1) = 3! (.3)1 (.2)1 (.5)1 = 6(0.03) = 0.18 (1!)(1!)(1!) 4. Poisson Distribution. The Poisson distribution is similar to the binomial distribution in that both are used to model count data that varies randomly over time. As the number of trials increases, both distributions tend to converge when the probably of success remains fixed. The major difference is that for the binomial distribution the number of observations (trials) is fixed. In contrast, for the Poisson distribution the number of observations in not fixed but the period of time over which the observations (or trials) occur is fixed. An example might be studying the number of courses a student fails (or passes) during ninth grade. To study this using the binomial distribution, the researcher needs to obtain the total number of courses undertaken during the year (n) and the number of courses failed during this period. This is because there are only two possible outcomes (fail or pass). This can be used to calculate the failure rate in the population or sample ( or p). The Poisson distribution only needs to know the mean number of courses passed during students’ freshman year of high school. This distribution is often used when the probability of the event of interest occurring is very small (e.g., failing a course). For the Poisson distribution we can use the following, where if is the expected number of events expected to occur in a fixed interval of time, then the probability of observing k occurrences within the fixed time interval is as follows: e k P(Y k ) k! Let’s suppose that the expected number of course failures ( ) is 0.60 for freshman year. Then the probability of a student failing two courses can be estimated as: Ronald Heck EDEP 768E: Seminar in Categorical Data Modeling (F2012) 5 rev. Aug. 24, 2012 e 2 e0.60 0.602 (0.55)(0.36) P(Y 2) .099 , 2! 2! 2(1) where e is approximately 2.71828. The mean and variance are both summarized as . This suggests that as the event rate goes up, so does the variability of the event occurring. With real data, however, the variance is often greater than the mean, which is a situation referred to as overdispersion. Where this is likely to be a problem we can switch to a negative binomial distribution which has the effect of adding an extra term for including the overdispersion in the model. 2 To summarize: 1. Bernoulli distribution = probability of success in one trial ( = probability of success) 2. Binomial distribution = probability of k successes in n independent trials ( = probability of success) 3. Multinomial distribution = probability of ki successes in each of I categories in n independent trials ( = probability of success in each of I categories) 4. Poisson distribution = probability of k successes in a fixed time interval ( = number of expected successes, or occurrences, in a fixed interval) References Agresti, A. (1996). An introduction to categorical data analysis. NY: Wiley. Azen, R. & Walker, C. (2011). Categorical data analysis for the behavioral and social sciences. New York: Routledge.