Download The probability framework for statistical inference

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
The probability framework for
statistical inference
Population
 The group or collection of entities of
interest
 Here, “all possible” school districts
 “All possible” school districts means “all
possible” circumstances that lead to
specific values of STR (student-teacher
ratio), test scores
 The set of “all possible” schools
districts includes but is much larger than
the set of 420 schools districts observed
in 1998.
 We will think of populations as infinitely
large; the task is to make inferences
from a sample from a large population
Random variable Y
 A random variable assigns a
number to each member of the
population in a particular way.
 The adjective “random” refers to
the fact that the value the variable
takes is determined by a drawing
from the population.
 The district average test scores
and the district STRs are random
variables; their numerical values
are determined once we choose a
year/district to sample.
Characterizing Random Variables:
-
Distribution
Moments of the Distribution
Joint and Distributions
Covariance; Correlation
Conditional Moments of the
Distribution
Population distribution of Y
 Discrete Random Variables: The
probabilities of different values of Y that
occur in the population
For ex. Pr[Y = 650], when Y is
discrete. This probability is the
proportion of elements of the
population for which the value of
Y is exactly equal to 650.
 Continuous Random Variables: The
probabilities of sets of these values
For ex. Pr[Y  650], when Y is
continuous. This probability is the
proportion of elements of the population
for which the value of Y is less than or
equal to 650.
“Moments” of the population
distribution
mean
= expected value = E(Y) = Y
= long-run average value of Y
over repeated realizations of Y
For a discrete random variable, the mean is
found by a weighted average of each
possible value of Y, where the weight
assigned to a given value of Y is the
probability of that Y.
For a continuous random variable the mean
is found by integrating over all possible
values of Y weighting each value of Y by
the “density function” evaluated at that Y.
variance = E(Y – Y)2
=  Y2
= measure of the squared
spread of the distribution
standard deviation = variance = Y
Note that –
1. The variance is an expected
value of a random variable.
2. The variance is in squared units
of Y; the standard deviation is in
the same units as Y.
Joint distributions
 Corresponding to each member
of the population there may be
more than one value assigned.
E.g., test score (Y) and STR
(X).
 There is a probability
distribution for Y (from which
we can derive the mean and
variance of Y) and a probability
distribution for X (from which
we can derive the mean and
variance of X).
 The joint probability
distribution for Y and X
provides the probability that the
random variables Y and X take
on the values y and x,
respectively (if Y and X are
discrete random variables), i.e.,
Prob(Y=y and X=x), or the
probability that the random
variables Y and X lie in some
subset of R2 (if Y and X are
continuous random variables),
i.e., Prob(Y<y and X<x).
 For example, what is the
probability of drawing a district
from the population for which
the average test score is 650
and the STR is 20?
 The marginal distributions of
Y and X are simply the
individual probability
distributions of Y and X, which
can be recovered from their
joint distribution (although the
reverse isn’t true.)
 The random variables Y and X
are independent if (and only
if) their joint distribution
factors into the product of their
marginal distributions, i.e.,
Prob(Y=y and X=x) =
Prob(Y=y)*Prob(X=x)
Prob(Y<y and X<x) =
Prob(Y<y) Prob(X<x)
for all x and y.
The covariance between r.v.’s X and
Y is,
cov(X,Y) = E[(X – X)(Y – Y)] = XY
 cov(X,Y) > 0: X and Y are
positively related; when X is above
(below) its mean, Y tends to be
above (below) its mean. cov(X,Y) <
0 … (We hypothesis that the random
variables test score and STR have a
negative covariance.)
 If X and Y are independently
distributed, then cov(X,Y) = 0 (but
not vice versa!!)
The correlation coefficient is defined
in terms of the covariance:
corr(X,Y) =
cov( X , Z )
 XZ
= rXZ

var( X ) var( Z )  X  Z
 –1  corr(X,Y)  1
 corr(X,Y) = 1 mean perfect
positive linear association
 corr(X,Y) = –1 means perfect
negative linear association
 corr(X,Y) = 0 means no linear
association
Conditional distributions
 The distribution of Y, given
value(s) of some other random
variable, X
 So, conditional distributions are
distributions of “subpopulations,”
created from the original
population according to some
criteria.
 Ex: the distribution of test scores,
given that STR < 20. (Divide the
population into two subpopulations
according to their STRs. Then
consider the distribution of test
scores for each population.)
Moments of conditional
distributions
 conditional mean = mean of
conditional distribution
= E(Y|X = x) (important notation)
 conditional variance = variance of
conditional distribution
 Example: E(Test scores|STR <
20), the mean of test scores for
districts with small class sizes;
Var(Test scores|STR < 20), the
variance of test scores for districts
with small class sizes;
The difference in means is the
difference between the means of
two conditional distributions:
 = E(Test scores|STR < 20) – E(Test
scores|STR ≥ 20)
Other examples of conditional means:
 Wages of all female workers (Y =
wages, X = gender)
 One-year mortality rate of those
given an experimental treatment (Y
= live/die; X = treated/not treated)
The conditional mean is a new term
for a familiar idea: the group mean
Inference about means, conditional
means, and differences in
conditional means
We would like to know  (test score
gap; gender wage gap; effect of
experimental treatment), but we don’t
know it. (We don’t know it? Didn’t
we calculate  last week?)
Therefore we must collect and use
data by sampling from the population,
permiting us to make statistical
inferences about .
 Experimental data
 Observational data
Simple random sampling
 Choose an individual (district,
entity) at random from the
population
Randomness and data
 Prior to sample selection, the value
of Y for is random because the
individual selected is random
 Once the individual is selected and
the value of Y is observed, then Y
is just a number – not random
 The data set is (Y1, Y2,…, Yn),
where Yi = value of Y for the ith
individual (district, entity)
sampled.
 Thus, the data set is made up of
realized values of n random
variables.
Implications of simple random
sampling
Because individuals #1 and #2 are
selected at random, the value of Y1 has
no information content for Y2. Thus:
 Y1, Y2 are independently
distributed
 Y1 and Y2 come from the same
distribution, that is, Y1, Y2 are
identically distributed
 That is, a consequence of simple
random sampling is that Y1 and Y2
are independently and identically
distributed (i.i.d.).
 More generally, under simple
random sampling, {Yi}, i = 1,…, n,
are i.i.d