Download 1. Fundamentals of Probability Theory and Mathematical Statistics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Statistics wikipedia , lookup

Randomness wikipedia , lookup

Foundations of statistics wikipedia , lookup

History of statistics wikipedia , lookup

Probability interpretations wikipedia , lookup

Probability wikipedia , lookup

Transcript
KC Chen Probability Theory and Mathematical Statistics 1. Fundamentals of Probability Theory and Mathematical Statistics Statistics are exploring methodology to derive conclusions from observed data of experiments while any uncertainty exists. Statistical decision and learning based on mathematical statistics creates foundation of state‐of‐the‐art information communication technology (ICT), and of course statistical communication theory. We shall minimize mathematical theory (such as measure theory) to introduce probability theory and mathematical statistics. 1.1 Probability and Radom Variables Probability has been used to model uncertainty, including one major application of gambling. Classical probability proceeds on intuition. Following progress in modern mathematical analysis, axiomatic probability theory has been developed and applied in many aspects of engineering and computer science. More applications can be used in nature science, biology and ecology, and social science. We are summarizing some fundamental probability theory that is useful in statistical decision, inference, and learning, in this section. To construct probability, our first encounter would be the outcomes from an experiment. The space of elementary outcomes, usually a non‐empty set, is denoted as Ω, whose elements ω Ω are called elementary outcomes. Ω is called sample space. Example: We roll a dice and there are 6 outcomes. Ω
1, 2, 3, 4, 5, 6 . Definition: A collection of of subsets of Ω is called algebra with the following properties: (i)
(ii)
(iii)
implies C C
,…,
implies ∑
Definition: A σ‐algebra or σ‐field the following properties: is a non‐empty collection of subsets of Ω with 1 Probability Theory and Mathematical Statistics (i)
(ii)
(iii)
Ω
If If KC Chen ,
:
1, 2, … ,
Definition: A measureable space (Ω, ) is a pair consisting of a sample space Ω and a σ‐algebra of subsets of Ω (that is also known as the event space). Definition: A probability space , , is a triple consisting of sample space , a σ‐field of subsets of , and a probability measure defined on the σ‐field. That is, assigns a real number to every member of such that the following conditions are satisfied: (i)
(ii)
(iii)
Lemma: If 0, 1 disjoint ,
∑
1, 2, … ,
(empty or null set), that is, , and . Remark: This Lemma suggests continuity at . Therefore, it is straightforward to derive the following two lemmas: continuity from below and continuity from above, by setting . Lemma: If , then lim
. Lemma: If , then lim
. Remark: To prove that is a probability measure, the following inequality must be hold. Definition: A random variable (i.e. a measureable function) defined on (Ω, ) and taking values on ,
, is a mapping (or function) f: Ω
with the property that if , then :
Remark: There involve two measureable spaces here: (Ω, ) and ,
. We may treat the first measureable space as an input space and the second one as output space. A random variable is just a mapping whose inverse images of the input events 2 KC Chen Probability Theory and Mathematical Statistics are events in the original measureable space. If we consider in (Euclidean space), we are dealing random vectors. Definition: A collection of random variables , where is an index set, defined on the common probability space is called a random process. Remark: If is countable, it is a discrete‐time random process. Otherwise, it is a continuous‐time random process. 1.2 Convergence of Random Variables and Limit Theorems Before going to details of mathematical statistics, we have to look into convergence behaviors of random variables. Definition: A sequence of random variables converges in distribution to , if such that is continuous at . Definition: A sequence of random variables if |
|
0 as ∞ converges in probability to , 0. Remark: Above definition suggests that if the chances that and differ by any given amount is negligible when is large. Theorem: If , then . Corollary: If X
c (a constant), can imply . Corollary: If (a constant), and is continuous at , then Corollary: If and is continuous, then . . Theorem: If and (a constant), then 3 Probability Theory and Mathematical Statistics (a)
KC Chen (b)
Proof: We prove (a) and the proof of (b) is similar. ,
,
Let be a point of continuity of . Because a distribution function has at most countably many points of continuity, for any , we select 0 such that are both points of continuity of . |
|
|
|
, Because lim sup
lim
lim
. |
|
|
Similarly, 1
lim inf
lim
|
Consequently, lim inf
0 and is continuous at . Then, (a) follows. ¶ Definition: A sequence of random variables almost surely, or with probability 1) to , converges almost everywhere (or, . .
Definition: A sequence of random variables ) to , |
4 1 converges in the th mean (or, in |
0 converges in mean square to , interests in many engineering problems as distance. Theorem: If lim
if lim
In particular, if . .
, then . . .
if 2, which is of stands for the square of Euclidean KC Chen Probability Theory and Mathematical Statistics Remark: The converse of this theorem is not true. A good counter example shall consider sample space. Let Ω
0, 1 and the event space 0,1 . We consider ,
,
1,2, … ,
,
1, and define random variables via the indicator function. We re‐arrange the random variables ,
,
, …. It is clearly 1
,
,
, … as , and also in mean square, but does not converge almost everywhere. Theorem: If . .
, then . Remark: The following figure similar to the well known Venn’s diagram may represent the relationship of these 4 kinds of convergence. Convergence in Probability
Convergence
almost
everywhere
Convergence
in Lp
Convergence in Distribution
Figure 1: Relationship of Convergence In the following, we are going to introduce different versions of Law of Large Numbers to study the convergence behaviors. Theorem (Bernoulli’s Weak Law of Large Numbers): If is a sequence of random variables such that ~
, 1, then 5 Probability Theory and Mathematical Statistics KC Chen Theorem (Chebychev’s Inequality): If is any random variable, then | |
Corollary: Let be a nonnegative function on the range of a random variable . Then, such that is non‐decreasing on Proof: 1
1
1Z
¶ Corollary (Markov Inequality): For a random variable ,
| |
. Corollary (Chernoff Bound): For a random variable and a constant , min
Remark: These inequalities are useful in information theory. The Chernoff bound usually offers a tighter bound than Chebychev inequality. Theorem (Weak Law of Large Numbers): Let be a sequence of independent and identically distributed (i.i.d.) random variables with mean and variance Let ∑
. be the sample mean. |
|
|
and |
0 as ∞. Theorem (Khintchin’s Weak Law of Large Numbers): Let ∑
i.i.d. random variables with mean ∞ and 1 be a sequence of . Then, Theorem (De Moivre‐Laplace): is a sequence of random variables such that , has a Bernoulli distribution , 0
1. Then, 6 KC Chen Probability Theory and Mathematical Statistics 0,1 1
Central Limit Theorem: Let and variance ∞. If be a sequence of i.i.d. random variables with mean ∑
, 0,1 √
Theorem (Kolmoorov’s Strong Law of Large Numbers): Let of i.i.d. random variables with mean ∞. Then, ∑
.
1 be a sequence 1.3 Mathematical Statistics To understand unknown behaviors, we must conduct some experiments to acquire useful information from observations (or data). A typical experiment is to sample. Given a random experiment with sample space Ω, we define a random vector , ,…,
. When denotes the outcome of the experiment, is referred as the observations or data. Since we only observe , we only care its probability distribution, which is assumed to be a member of a family of probability distributions on . Therefore, is known as the model. We usually describe by parameterization (but not necessary), that is, by a mapping from the parameter space (a space of label) to . In other words, : and models are called parametric if is a “nice” subject of such that is a smooth mapping. Definition: A function of one or more random variables that does not depend on any unknown parameter is called a statistic. Definition: Let , , … , denote a random sample of size from a given (known or unknown) distribution. The statistic 1
is called the sample mean and the statistic 7 Probability Theory and Mathematical Statistics KC Chen is called the sample variance. Remark: If we inappropriately design an experiment, we may induce errors in sampling, in addition to observation errors. However, please keep in mind that the value of statistics lays in deriving useful information or conclusion from data in presence of uncertainty. If this reasoning is based on random variables, it is called statistical inference. The following example gives the most straightforward construction of statistical inference. Example (Regression): We observe pairs of random variables , ,…,
,
. is a ‐dimensional vector representing various characteristics of the th subject in the experiment or study, such as height, weight, age, etc. The distribution of response for the th subject is postulated to depend on characteristics . In general, is a nonrandom vector called covariate vector or a vector of explanatory variables. is random and known as response variable or dependent variable as its distribution depending on . If | denotes the density of for a subject with covariate vector , then the model is ,…,
|
to of interests, to denote the Let be an unknown function from expectation of a response with a given covariate vector . We can then write ,
1, … , where ,
1, … , . As a special case, by defining and identically distributed , a linear regression model thus becomes ,…,
,1
Please note that we usually conduct mean‐squared error criterion in EE&CS, and same to derive linear regression using . Lemma: There are some useful properties of conditional probability for statistical inference, where stands for a hypothesis. (Product or Chain Rule) , |
| ,
|
| ,
| ∑
∑
(Sum Rule) |
, |
| ,
| 8 KC Chen (Bayes) | ,
| ,
|
|
∑
Probability Theory and Mathematical Statistics | ,
| ,
|
|
(Independence) ,
Actually, probability can be used to model more general problems including human knowledge systems (for belief, cognition, trust, etc.), which involves reasoning in the presence of uncertainty. It can be linked by the famous Cox Axioms. Cox Axioms: Let “the degree of belief in proposition ” be denoted by . The negation of is . | means that the degree of belief in a conditional proposition assuming proposition to be true. (a) Degree of belief can be ordered. That is, if and , then . (b) The degree of belief in a proposition and its negation are related. There is a function such that . (c) The degree of belief in a conjunction of propositions , (
) is related to the degree of belief in the conditional proposition | and the degree of belief in proposition . There exists a function such that ,
| ,
. If above 3 conditions are satisfied, we can model the belief as a probability. Remark: Belief can be mapped into real numbers. Derivations of probability or statistics often fall into two scenarios: forward and inverse. Forward problems usually involve a generative model describing a process to deliver data. Example: An urn contains balls, among which B are blue and are white. Danny randomly draws a ball from the urn and replaces it for times. (a) What is the probability distribution of the number of times a blue ball is drawn, ? (b) What is the expectation of ? What is the variance of ? Answer: Define var
/
1
. | ,
1
and thus . ¶ Inverse probability problems again involve a generative model. However, instead of computing probability distribution of certain quantity induced by the model, we 9 Probability Theory and Mathematical Statistics KC Chen derive the conditional probability of unobserved variable(s) in the model, given observed variable(s). Bayes theorem this plays a dominating role. Lemma (Bayes’ Theorem): Suppose , is a random vector with density function ,
. The conditional density function of given |
is 0. Similarly, is the conditional density of given |
|
|
,
if . Example: There are 11 urns labeled by 0, 1, 2, …, 10, and each urn contains 10 balls. Urn contains blue balls and 10
white balls. Danny randomly selects an urn and draws times with replacement, to obtain blue balls and white balls. Chloé examines the results without knowing the selected urn. In 10 draws, 3 blue balls have been drawn. For Chloé, what is the probability that Danny is using urn ? Answer: The conditional probability of given |
,
We may denote | ,
|
is 1
|
1
11
10
1
10
. ¶ Remark: Above inverse probability problem is useful to illustrate some terminologies in statistical inference or statistics. The marginal probability is called the prior probability of . | , is called the likelihood. It is vital to note the difference between probability and likelihood. | , is a function of both and . For fixed , | , defines a probability over . For fixed , | , defines the likelihood of . The conditional probability | , is called the posterior probability of given . | is a constant independent of and is known as the evidence or marginal likelihood. Proposition: If denotes the unknown parameter, denotes the data, and denotes the overall hypothesis space, then we have | ,
| ,
|
|
Remark: In high‐level language, above proposition means 10 KC Chen Probability Theory and Mathematical Statistics likelihood prior
evidence
posterior
Example (continue): Chloé observes Danny’s drawing 3 blue balls in 10 draws. Supposing that Danny draws another ball from the same urn, what is the probability that next drawn ball is blue? Please note that this is so‐called prediction. Answer: Let represents the event of ball 1 being blue. |
,
| ,
,
|
,
|
,
|
3,
0 1 0 0.063 2 3 4 5 6 7 8 9 10 0.22 0.29 0.24 0.13 0.047 0.099 0.00086 0.0000096 0 Table: Conditional probability of given 3 and 10 ball 1 being blue|
3 and 10 =0.333 ¶ Example (Laplace Formula): Suppose we have a bent coin with uneven probabilities for head and tail in a series of independent tossing as a series of Bernoulli trials. For the first trials, there are times to have head and times for tail. The famous Laplace Formula tells that the probability of next independent tossing to be head is 1
2
1
2
Proposition (Likelihood Principle): Given a generative model for data given parameter , | , and having observed a particular outcome , all inference and predictions should depend only on the function | . 11 Probability Theory and Mathematical Statistics KC Chen Remark: Although likelihood principle is simple, many classical statistical methods violate it. 1.4 Information and Entropy As a matter of fact, source coding (e.g. data compression) has close relationship with data modeling and inverse probability problem. To deal with information transmission, we may consider an ensemble as a triple ,
,
, where denotes outcomes from a random variable; denotes possible values (that is, alphabets in information transmission or information theory); denotes probabilities (or possibilities). Definition: Shannon’s information content of an outcome 1
log
Definition: Entropy of an ensemble is log
Lemma: 0 with equality if an only if 1 for one . is uniformly distributed. Then, Lemma: Entropy is maximized if log |
| with equality if and only if |
|
. Corollary: The joint entropy of and is ,
,
log
,
,
Definition: The Relative Entropy or Kullback‐Leibler Divergence between two probability distributions ,
over the same is ||
Corollary (Gibb’s Inequality): Remark: ||
||
12 ||
log
0 in general, and thus is not a distance in KC Chen Probability Theory and Mathematical Statistics general. In the following, we are going to introduce the well known and useful inequality for convex functions. Definition: A function is convex up over , if every chord of the function lies above the function. In other words, ,
, and 0
1, A function 0 and 1
is strictly convex up if 1
,
,
. The equality holds only for 1. Lemma (Jensen’s Inequality): If is a convex up function, and is a random variable, then Remark: Jensen’s inequality can be rewritten for a convex (or concave) down function by reversing the inequality. A physical meaning of Jensen’s inequality: In case masses are placed on a convex up curve , i.e. at locations ,
, the center of gravity of these masses lies above the curve. Exercises: 1. The outer‐measure of an interval , is defined to be . Let denote the set of all rational numbers within 0, 1 and denote the set of all 2.
irrational numbers. Please find the outer‐measure for and . The condition (iii) in the definition of σ‐algebra, it is equivalent to that If 1, 2, … ,
. 3.
Suppose takes values only on 4.
Suppose , are continuous and 5.
Please find another example that 6.
(Berry‐Esséen Theorem) Let mean and variance sup |
7.
Suppose , ,…,
distribution. Let and . Then, . Then, does not imply . .
. . . be a sequence of i.i.d. random variables with . Then, 33 |
|
4
√
√
, … are independently sampled from ,
max
, ,…,
. (a) Is a monotonically |
13 Probability Theory and Mathematical Statistics 8.
KC Chen increasing function of ? (b) For 0,1 , 3 of ,…,
are larger than . Is it likely? Let be a continuous random variable such that 0
0 and 0 0. Let , , … denote independent and identically distributed (i.i.d.) random variables of probability density function (PDF) . Suppose min
,…,
. Please show that 0. 9. Please derive the Laplace formula in Section 1.3. 10. ABC company manufactures computers for XYZ brand and of them are defective. The quality control team in XYZ samples computers to check and finds defectives. For max 0,
min ,
, find the distribution for each , which is known as hyper‐geometric , , . 11. The entropy of a random variable is defined as log
(a) Please find an example of that 0. (b) The information divergence between two probability distributions and on a common alphabet X is defined as ||
log
Please prove that ||
0, which is know as Gibbs’ inequality. Hint: 0,
1, with equality if and only if 1. (c) Please prove that for any random variable , log |X| Where |X | denotes the size of alphabet X. Hint: (b) is useful. This inequality leads to the famous Fano inequality. (d) Please find the condition and thus an example to tight the upper bound in (c). 12. A source randomly produces a character from the alphabet 0,1, . . ,9, a, b, … , z . is a numeral (i.e. from 0,1, . . ,9 equally probable with total probability 1/3; is a vowel (i.e. from a, e, i, o, u ) equally probable with total probability 1/3; is one of the 21 consonants equally probable with total probability 1/3. Please find the “quickest” estimate of the number of bits to represent in binary bits. Hint: Please consider the decomposability of the entropy. 14