Download Basics We often denote our sample space with S and an event (or

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Probability wikipedia , lookup

Ars Conjectandi wikipedia , lookup

Transcript
Basics
The number of distinguishable permutations of
We often denote our sample space with S and an n objects with k different types (n1 of the first
event (or subset) within S as A. For any event, type, ..., nk of the kth type) is
P (A) ≥ 0
and we also require
n!
,
n1 !n2 ! . . . nk !
if n1 + . . . + nk = n.
P (S) = 1,
An unordered arrangement of r objects from a set
which is equivalent to saying that something A containing n objects is called a combination.
must happen. A side effect of this is that
The number of r-element combinations is
( )
c
n
n!
P (A) + P (A ) = 1,
=
.
r
r!(n − r)!
which, if we push it to arbitrary sets, says that
P (A) =
n
∑
P (Ak )
k=1
if A = A1 ∪ . . . ∪ An and Ai ∩ Aj = 0.
The binomial expansion is defined for integer n ≥
0 as
n ( )
∑
n n−i i
n
x y.
(x + y) =
i
i=0
Many ideas from set theory carry over to proba- Conditional Probability
bility, such as
The conditional probability of a A given B is
if A ⊆ B, then P (A) ≤ P (B),
P (A|B) =
or
P (A ∩ B)
,
P (B)
for P (B) > 0. The standard axioms can be
proved as theorems for conditional probability
Extending this last idea further gives us the distributions. This formula can be reversed to
inclusion-exclusion principle: P (A1 ∪ . . . ∪ An ) give
can be calculated by finding all the intersections
P (A ∩ B) = P (A|B)P (B) = P (B|A)P (A).
between those sets and adding all the ”odd intersections” and subtracting all the ”even inter- This idea allows us to write
sections.”
P (A) = P (A|B)P (B) + P (A|B c )P (B c ).
P (A ∪ B) = P (A) + P (B) − P (A ∩ B).
Combinatorics
Let E1 , . . . , Ek be sets with n1 , . . . , nk elements
respectively. Then there are n1 n2 . . . nk ways by
which one element from each set may be chosen.
A set with n elements has 2n subsets.
If we define a partition of our sample space S as a
set {B1 , . . . , Bn∪} where these events are mutually
exclusive and ni=1 Bi = S, then the law of total
probability is a natural extension of this concept
P (A) =
n
∑
P (A|Bi )P (Bi ).
i=1
An ordered arrangement of r objects from a set of
n distinguishable objects is called a permutation. Plugging this in to the conditional probability
definition gives us Bayes’ Theorem
The number of r-element permutations is
n Pr
=
n!
.
(n − r)!
P (A|Bk )P (Bk )
.
P (Bk |A) = ∑n
i=1 P (A|Bi )P (Bi )
The event A is considered independent from the We also define the standard deviation
event B if P (A|B) = P (A). These two events
√
σ
=
Var(X).
X
are declared independent if
P (A ∩ B) = P (A)P (B).
The term i.i.d. stands for ”independent and
identically distributed.”
We know that if A and B are independent, so are
A and B c . To extend this to more than two sets, Special Discrete Random Variables
all subsets must be declared independent by the A Bernoulli RV X ∼ Bernoulli(p) takes value 1
with probability p and 0 with probability 1 − p.
same decomposability of probability.
E(X) = p and Var(X) = p(1 − p).
Basic Random Variables
If an experiment has sample space S, a random A Binomial RV X ∼ Binom(n, p) is the numvariable is a function X : S → R if for any I ⊆ R ber of successes in n i.i.d. Bernoulli
(n) k trials with
parameter
p.
P
(X
=
k)
=
p (1 − p)n−k ,
{s : X(s) ∈ I} is an event.
k
E(X) = np and Var(X) = np(1 − p).
The function FX defined on R is the cumulative
distribution function, or CDF, of random variable A Poisson RV X ∼ Poisson(λ) is the number of
successes over a given time with rate of success
X if FX (t) = P (X ≤ t).
k
λ. P (X = k) = e−λ λk! , and E(X) = Var(X) =
λ. If Y ∼ Binom(n, p) and n is very large, but
Every CDF must be:
.1 < np < 10, then y ∼ Poisson(np) is a good
• nondecreasing,
approximation.
• limt→∞ F (t) = 1,
• limt→−∞ F (t) = 0,
A Geometric RV X ∼ Geom(p) is the number of
• right continuous.
failures before the first success, if p is the probability of success. P (X = k) = (1 − p)k−1 p,
If X can take countably many values, it is a E(X) = 1/p and Var(X) = (1 − p)/p2 . Geometdiscrete random variable. The probability mass ric RV satisfy a memoryless property, P (X >
function p of a discrete RV satisfies:
n + m|X > m) = P (X > n).
• p(x) = 0 if X cannot take the value x,
• p(x)
∑∞ = P (X = x) if X can take the value x,
•
i=1 p(xi ) = 1 for values xi that X may
take.
The expected value of a discrete RV X is
∑
E(X) =
xp(x),
A Negative Binomial RV X ∼ NBinom(p, r) is
the number of trials required before r successes,
where
p is probability of success. P (X = k) =
(n−1) r−1
p (1 − p)n−r , E(X) = r/p and Var(X) =
r−1
r(1 − p)/p2 .
A Hypergeometric RV X ∼ Hyper(N, D, n) is the
number of broken units if n units are chosen from
x∈A
N units, of which D are broken, without replacewhere A is the set of values X may take. We also ment. P (X = k) = (D)(N −D)/(N ), E(X) =
k
n−k
n
know that
nD/N and Var(X) = [nD(N − D)/N 2 ][(N −
∑
n)/(N − 1)].
E(h(X)) =
h(x)p(x).
x∈A
This allows us to compute the variance
Var(X) = E[(X − E(X))2 ] = E(X 2 ) − E(X)2 .
Continuous Random Variables
A continuous RV X has a CDF FX which is continuous. By differentiating, fX = FX′ gives us the
probability density function of X, wherever FX is
differentiable. This allows for
∫ b
P (a < X < b) =
fX (t)dt
The most important distribution is the Normal
or Gaussian distribution. If X ∼ N (µ, σ), for
σ > 0, then
a
1
2
2
fX (t) = √ e−(t−µ) /2σ ,
σ 2π
= F (b) − F (a).
Our definition of CDF therefore requires that
∫ ∞
fX (t)dt = 1,
t ∈ R.
As a special case, we may use the letter Z ∼
N (0, 1) to denote the Standard Normal distribution. We know
−∞
and that fX (t) > 0, that FX is nondecreasing.
E(X) = µ,
Var(X) = σ 2 .
Sometimes we may use Φ(t) = FZ (t), though I
Note that P (a < X < b) = P (a ≤ X ≤ b) if X don’t use that often. As a symmetric density
is a continuous RV.
(about t = 0), we know
FZ (−t) = 1 − FZ (t).
The expected value of a continuous RV X is
∫ ∞
E(X) =
xfX (x)dx.
Things such as IQ, height, GPA or other such
quantities may be normally distributed.
As before, we know that
∫ ∞
E(g(X)) =
g(x)fX (x)dx.
The exponential distribution is also important,
and allows us to describe the time until some
event occurs. If X ∼ Exp(λ), for λ > 0, then
∞
∞
fX (t) = λe−λt ,
This allows us to compute
t ∈ [0, ∞).
We know that
Var(X) = E[(X − E(X)) ] = E(X ) − E(X) .
1
1
E(X) = ,
Var(X) = 2 .
√
λ
λ
As before, σX = Var(X) is the standard deviAn important feature of the exponential distriation.
bution is its memoryless property:
Common Continuous Distributions
P (X > s + t|X > t) = P (X > t).
For a continuous RV X, there are several distributions which occur often enough to warrant
If we want to consider the time until r events
names.
occur, we will instead need the Gamma distribution. If X ∼ Gamma(r, λ), for r, λ > 0, then
The Uniform distribution represents a point being selected randomly from an interval without
λ −λt
e (λt)r−1 , t ∈ [0, ∞),
fX (t) =
any preference between points. If X ∼ Unif(a, b),
Γ(r)
for a > b, then
where
∫ ∞
1
, t ∈ [a, b).
fX (t) =
Γ(r) =
tr−1 e−t dt = (r − 1)!.
b−a
2
2
2
0
This tells us that
E(X) =
a+b
,
2
We know that
Var(X) =
(b − a)2
.
12
E(X) =
r
,
λ
Var(X) =
r
.
λ2
Bivariate Distributions
This in turn implies that
To study how two discrete RVs interact together,
F (x, y) = FX (x)FY (y),
we need to consider their joint probability mass
function:
which has the, respectively, discrete and contin∑∑
p(x, y) = P (X = x, Y = y),
p(x, y) = 1, uous results
x∈A y∈B
p(x, y) = pX (x)pY (y), f (x, y) = fX (x)fY (y).
where A and B are the sets of points at which X
An important result of this is that, if X and Y
and Y may take value.
are independent RVs,
The marginal mass function pX (x) is defined as
E[g(X)h(Y )] = E[g(X)]E[h(Y )].
∑
pX (x) =
p(x, y),
The Covariance between RVs X and Y is
y∈B
Cov(X, Y ) = E [(X − E(X))(Y − E(Y ))]
with pY (y) defined analogously. Using these
= E(XY ) − E(X)E(Y ).
marginal mass functions we can compute
∑
∑
Note that Cov(X, X) = Var(X), and if X and Y
E(X) =
xpX (x), E(Y ) =
ypY (y).
are independent then Cov(X, Y ) = 0.
x∈A
y∈B
We also know that
E(h(X, Y )) =
∑∑
h(x, y)p(x, y).
x∈A y∈B
Central Limit Theorem
A random sample is a collection of i.i.d. random
variables X1 , . . . , Xn . After observing these variables, X1 = x1 and so on, these statistics can be
used to study the population of Xi .
The same ideas are all true in the continuous setting: a joint probability density function f (x, y) The Central Limit Theorem states that if
is a function such that, for X and Y RVs and a X1 , X2 , . . . , Xn is a sequence of iid RVs with
E(Xi ) = µ and Var(Xi ) = σ 2 , then
region R ⊆ R2 ,
∫∫
1
X̄n = (X1 + . . . + Xn )
P ((X, Y ) ∈ R) =
f (x, y) dx dy.
n
R
We know that
∫ ∞∫
is a random variable whose distribution approaches N (µ, σ 2 /n) for large n.
∞
f (x, y) dx dy = 1.
∞
∞
We define the marginal density functions
∫ ∞
∫ ∞
fX (x) =
f (x, y) dy, fY (y) =
f (x, y) dx,
∞
∞
and with those we can define E(X), E(Y ).
Two random variables are called independent if,
for arbitrary A, B ⊆ R,
P (X ∈ A, Y ∈ B) = P (X ∈ A)P (Y ∈ B).
Using the central limit theorem, we can approximate our population mean E[Xi ] = µ using the
sample mean X̄n = x̄.
For increasingly large samples, Var X̄n decreases,
meaning that the sample mean x̄ is an increasingly good approximation of the true mean µ.