Download 1_ClassNotes

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Inductive probability wikipedia , lookup

Foundations of statistics wikipedia , lookup

History of statistics wikipedia , lookup

Categorical variable wikipedia , lookup

Gibbs sampling wikipedia , lookup

Law of large numbers wikipedia , lookup

Transcript
Ronald Heck
EDEP 768E: Seminar in Categorical Data Modeling (F2012)
1
rev. Aug. 24, 2012
Week 1: Class Notes
Categorical data refer to a variable measured on a scale that simply classifies individuals into a
limited number of groups (ethnicity, gender, political affiliation).
Categorical variable are typically discrete, which means that can only take on a finite number of
values. We typically summarized this type of data as a frequency distribution, which
summarizes the number of responses in each category, as well as the probability of occurrence of
a particular response category and the percentage of responses in each category.
The probabilities in a table are proportions, and are computed by dividing the frequency of
responses in a particular category by the total number of respondents. So the probability of
being in the first category is 3/10 = 0.30. The probability of being in the second category is 2/10
or 0.20. The probability of being in the third category is 5/10 or 0.50.
Group
Frequency
Valid
1.00
2.00
3.00
Total
3
2
5
10
Percent
30.0
20.0
50.0
100.0
Valid Percent
30.0
20.0
50.0
100.0
Cumulative
Percent
30.0
50.0
100.0
Scales of Measurement. Measurement is applying a particular rule to assign numbers to objects
or subjects for the purpose of differentiating between them on a particular attribute such as a test
or some type of psychological construct (e.g., motivation).
Four Characteristics:
1. Distinctiveness (if the numbers assigned to subjects or objects differ on the property
being assigned such as party affiliation). We often call this nominal.
2. Magnitude [if the different numbers that are assigned can be ordered in a meaningful
way such as from “not important” to “important” (1 to 5)]. Often referred to as
ordinal.
3. Equal intervals [if equivalent difference between two numbers that are assigned to
subjects or objects have an equivalent meaning (an example is a difference in
temperature from 60 to 50 is the same as from 70 to 60)]. This is often referred to as
an interval scale.
4. Absolute zero (if assigning a score of 0 to a person or object indicates the absence of
the attribute being measured). This is often referred to as a ratio scale (which is
uncommon in the social sciences). An example would be if we have no spelling errors
or no typing errors (see Azen & Walker, 2011 for further discussion). Keep in mind
this is not the same as a zero on a math test—which does not necessarily mean the
student has no math ability (maybe the test was too hard).
Ronald Heck
EDEP 768E: Seminar in Categorical Data Modeling (F2012)
2
rev. Aug. 24, 2012
History of Categorical Methods
The discussion refers back to early 1900s in a debate between Pearson and Yule. Pearson argued
that categorical variables were “proxies” for continuous variables. Yule argued categorical
variables were inherently discrete (Agresti, 1996). Pearson approached their analysis by
approximating their relationship on an underlying continuum. Yule developed a measure of
association that did not rely on approximating the underlying continuum. Both were partially
right. Certainly ordinal data behave something like interval/ratio data. A variable like pass/ fail
seems to relate to an underlying continuous scale (with a cut score). Others are likely discrete
(the light is on or off).
Probability Distributions (We will deal with primarily 4 distributions)
Properties of a sampling distribution typically depend on an underlying distribution of the
random variable of interest. An example is the sampling distribution of the mean, from which
the probability of obtaining particular samples with particular mean values can be determined.
For continuous variables this is based on the normal distribution.
Random variable is used to describe the possible outcomes that a particular variable may take on.
It is used to convey the fact that the outcome was obtained from some underlying random
process.
A probability distribution is a table or mathematical function that links the actual outcome
obtained (i.e., from an experiment or a random sample) to the probability of its occurrence.
For continuous variable, the assumption is that the values obtained are random observations that
come from a normal distribution. When the outcome is categorical, it can come from
distributions other than normal. Azen and Walker (2012) provide a nice introduction to
examining these types of probability distributions.
1. Bernoulli Distribution (simplest probability distribution for discrete variables). A discrete
variable can only take on one of two values (0 = fail, 1 = pass). The probability is based on the
proportion of the category of interest (typically coded 1) versus the probability of the other event
occurring (coded 0).
The probability is often shown as π in the population and p in a sample. Consider the following
sampling distribution of a dichotomous outcome (being admitted =1, not admitted = 0).
Group
Frequency
Valid
.00
1.00
Total
7
3
10
Percent
70.0
30.0
100.0
Valid Percent
70.0
30.0
100.0
Cumulative
Percent
70.0
100.0
The probability of the event coded 1 occurring (π) is then 3/10 or 0.30, while the probability of
the event coded 0 occurring is 1- π or 0.70 in the sample above.
Ronald Heck
EDEP 768E: Seminar in Categorical Data Modeling (F2012)
3
rev. Aug. 24, 2012
The mean (μ), or expected value, is π and the variance can be calculated as π (π-1), which in this
case, will be 0.3(0.7) or 0.21. You can see that in this type of distribution the variance and the
mean cannot be independent—that is, the variance is tied to the mean. This is one key difference
between the mean and variance of discrete data versus continuous data (from a normal
distribution), where in the latter case, we can easily conceive of a sample distribution with mean
of 20 and SD of 2 versus a distribution with the same mean of 20 but a standard deviation of 10.
2. Binomial Distribution. This is similar to a Bernoulli distribution except that where the
Bernoulli distribution deals with only one “trial” or event, the binomial distribution deals with
two possible outcomes but more than one trial (denoted as n). The classic case is flipping a coin.
In a binomial distribution, the trials are considered as independent. So if you think about flipping
a coin, over a large number of trials the probability of obtaining heads is 0.50. Of course, in the
short run, like 10 flips, you might receive 8 heads and only 2 tails. The mean and variance of a
binomial distribution is μ = nπ and the variance is nπ(1- π).
In general for a series of n independent trails, we can estimate the probability as
n
k 
P(Y = k) =   πk(1-π)n-k
n
k 
where   is read “n choose k” and refers to the number of ways that k objects ( or individuals)
can be selected from a total of n objects, which can be represented as follows:
n!
k !(n  k )!
If in 3 trials (n = 3) we want the probably of three heads (k = 3), it would be the following:
 3
 3
P(Y = 3) =   0.53(0.5)3-3 =
3!
0.53(1) = 0.125 (which is the same as ½ x ½ x ½ = 1/8 =
3!(0)!
0.125) (Note: 0! = 1)
Let’s say now we want to know the probably of being proficient for two (k=2) of three students
randomly chosen (n = 3) in a particular class. Say the probably of being proficient is 0.7. The
probably of being proficient for two of three randomly selected students is then
 3
 2
P(Y = 2) =   0.72(0.3)3-2 =
3!
(0.72)(0.3)3-2 =3( 0.72)(0.3)1 = 0.441
2!(3  2)!
3. Multinomial Distribution. The multinomial distribution represents the multivariate extension
of the binomial distribution when there are I possible outcomes as opposed to only two
Ronald Heck
EDEP 768E: Seminar in Categorical Data Modeling (F2012)
4
rev. Aug. 24, 2012
outcomes. In the multinomial distribution each trial results in one of I outcomes, where I is a
fixed finite number and the probably of each possible outcome can be expressed by π1, π2,…, πI
such that the sum of all probabilities is 1. This then provides the probability of obtaining a
specific outcome pattern across I categories in n trials.
P(Y1  k1 , Y2  k2 ,..., andYI  k I ) =
n!
n!
1k1 2k2   Lkl
k1 ! k2 ! k L !
I
Note that the sum of the probabilities for the categories is

i 1
i
 1.
If we go back to our example at the top, suppose we select 3 students and want to know the
probably there will be one in Group 1, Group 2, and Group 3. In this multinomial distribution,
this is given by:
P(Y1 = 1, Y2 =1, Y3 =1) =
3!
(.3)1 (.2)1 (.5)1 = 6(0.03) = 0.18
(1!)(1!)(1!)
4. Poisson Distribution. The Poisson distribution is similar to the binomial distribution in that
both are used to model count data that varies randomly over time. As the number of trials
increases, both distributions tend to converge when the probably of success remains fixed. The
major difference is that for the binomial distribution the number of observations (trials) is fixed.
In contrast, for the Poisson distribution the number of observations in not fixed but the period of
time over which the observations (or trials) occur is fixed.
An example might be studying the number of courses a student fails (or passes) during ninth
grade. To study this using the binomial distribution, the researcher needs to obtain the total
number of courses undertaken during the year (n) and the number of courses failed during this
period. This is because there are only two possible outcomes (fail or pass). This can be used to
calculate the failure rate in the population or sample (  or p). The Poisson distribution only
needs to know the mean number of courses passed during students’ freshman year of high
school. This distribution is often used when the probability of the event of interest occurring is
very small (e.g., failing a course).
For the Poisson distribution we can use the following, where if  is the expected number of
events expected to occur in a fixed interval of time, then the probability of observing k
occurrences within the fixed time interval is as follows:
e   k
P(Y  k ) 
k!
Let’s suppose that the expected number of course failures (  ) is 0.60 for freshman year. Then
the probability of a student failing two courses can be estimated as:
Ronald Heck
EDEP 768E: Seminar in Categorical Data Modeling (F2012)
5
rev. Aug. 24, 2012
e   2 e0.60 0.602 (0.55)(0.36)
P(Y  2) 


 .099 ,
2!
2!
2(1)
where e is approximately 2.71828.
The mean and variance are both summarized as      . This suggests that as the event rate
goes up, so does the variability of the event occurring. With real data, however, the variance is
often greater than the mean, which is a situation referred to as overdispersion. Where this is
likely to be a problem we can switch to a negative binomial distribution which has the effect of
adding an extra term for including the overdispersion in the model.
2
To summarize:
1. Bernoulli distribution = probability of success in one trial (  = probability of success)
2. Binomial distribution = probability of k successes in n independent trials (  = probability of
success)
3. Multinomial distribution = probability of ki successes in each of I categories in n
independent trials (  = probability of success in each of I categories)
4. Poisson distribution = probability of k successes in a fixed time interval (  = number of
expected successes, or occurrences, in a fixed interval)
References
Agresti, A. (1996). An introduction to categorical data analysis. NY: Wiley.
Azen, R. & Walker, C. (2011). Categorical data analysis for the behavioral and social sciences.
New York: Routledge.