Download Basic of Probability Theory for Ph.D. students in Education, Social

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Central limit theorem wikipedia , lookup

Transcript
Basic of Probability Theory for Ph.D. students in Education, Social
Sciences and Business
(Shing On LEUNG and Hui Ping WU)
(April, May 2015)
This is a series of 3 talks respectively on:
A. Probability Theory
B. Hypothesis Testing
C. Bayesian Inference
Lecture 1: Probability Theory
1
A. Probability Theory
 Why Probability Theory?
 Model Vs data: Model is not real, but serves as a framework for
analysis and interpretations of real data (e.g. Normal distribution)
 Probability model is a model, not real, and a basic framework for our
uncertain future world
 For example, you conduct a research, take a sample, and ask how data
collected from your sample can be generalized to a wider population,
etc. The basic mechanism behind is probability theory.
 Probability Theory is the basic for many other topic, e.g. Hypothesis
Testing and Bayesian Analysis, and possibly many others.
2
Probability Theory
Axioms of probability
 Axiom 1: all probability > 0
 Axiom 2: sum of all possible event = 1 (axiom 1 and 2 already
imply probability < 1)
 Axiom 3: Pr(A or B) = Pr(A) + Pr(B) if A & B do not overlap
 These are axiom, not assumption, i.e. we start from these. These are
not assumptions to be challenged, i.e., the starting point.
 If we do not agree, we cannot move forward.
3
Independence
 Pr( A & B ) = Pr(A) * Pr(B), the chance for A does not affect the
chance of B.
 Say, the present of one student (or subject) does not affect the present
of the others (independent sampling).
 Say, higher IQ score does not lead to higher GPA (independence).
Sometimes, we want to assume independence, and with the hope that
it is rejected, and then establish the relation between two (IQ and
GPA in this case). This is the logic in Hypothesis Testing.
4
Conditional probabilities
 Pr(A/B) : Probability of A given that B exist
 Pr(A/B) = Pr(A) if A & B are independent, the present of B (say
student) do not affect the chance of A (other student) (independence
in sampling).
 Say, higher IQ will have a higher chance to get higher GPA, i.e.
Pr(A/B) > Pr(A) (dependence). Note, this is the chance and not
actual score, i.e., it is wrong to say higher IQ will have higher GPA.
But, a higher IQ will lead to a higher chance of getting higher GPA.
 If Pr(A/B) not equal to Pr(A), A & B are dependent.
 Conditional probability is the basic for Bayesian Analysis, and other
topics
5
Monty Hall Problem
You are ask to choose 1 out of 3 balls (A, B and C say). After you
have chosen A, you are told that B is not correct. Would you (i) keep
A, or (ii) change to C?
http://en.wikipedia.org/wiki/Monty_Hall_problem
6
Random variable
 A variable that is random (what do you meant?)
Variable
 Variable can take different values, e.g., sex – variable, male and
female – label; group – variable, experimental or control group –
value; IQ score – variable, IQ = 50 (actual value).
 Measurement is measurement of variations
Random
 Random variable, a variable take values according to a specified
probability model, or, simply, takes values by some chance.
 Say, Pr(Female) = 0.6; Pr(male) = 0.4, etc. The chance of getting a
boy in UM may be only 0.4. So, sex in UM is a random variable.
So, both words, random and variable, carry their meanings
7
Probability Distributions (models)
 A probability distribution is one that follows 3 axioms.
 “A” implies 1 out of many distributions behind
 Any distribution follow 3 axioms is probability distribution.
 Discrete vs continuous. Discrete distributions: probabilities
represented by a single point; Continuous distributions: probabilities
represented by an areas.
8
 Probability Density Function (pdf) and Cumulative Density
Distribution (cdf)
 pdf: plot the chance of a random variable against its value
 cdf: plot the total chance of a random variable that is less than its
value
 both pdf and cdf are useful in different situations
9
Examples of Probability Distributions
Discrete
Uniform
n = 5 where n = b − a + 1
10
Binominal (n=10, p=0.5)
Binominal (n=10, p=0.1)
11
Continuous
 Uniform
 Normal
12
 Chi squared
 Gamma
where k is the parameter, known where k and θ are the parameters,
denoting the shape and scale.
as df (degree of freedom)
13
Important note: Knowledge about “common” probability
distributions is necessary for “non-common”, e.g. Normal before
Fat Tail distributions.
14
Relationships among probability distributions
http://en.wikipedia.org/wiki/Relationships_among_probability_distribut
ions
Knowledge on 1 distribution facilitate knowledge of the others.
15
Expectation
E(X) = ʃ x p(x) dx, the “expected” value of x.
 “Expected” value imply (i) in the long run, and (ii) on the average,
the value we get
 Or, the value we get by taking all possibilities (defined by pdf) into
considerations
 Or, “automatic averaging” by pdf
 Expectation is needed for moments (described later), and in turn
characterize a distribution, and hence all are linked.
16
Expectation can be f(x) (function of x) instead of only x, i.e.
E( f(x) ) = ʃ f(x) p(x) dx
That is, the expected value of f(x) instead of just x.
Moments
 How to describe a distribution?
 Graphically by the distribution, pdf, or cdf
 Or, by some “summary statistics”: mean, SD, skewness, and kurtosis
17
Mean: location
SD or variance: dispersion
18
Skewness: “dis-symmetry” (degree away from symmetry)
Symmetric
19
Kurtosis
Below is copy from (on 13 Feb., 2015):
http://en.wikipedia.org/wiki/Kurtosis
 Below, each has: (i) mean=0, (ii) SD=1, and (iii) Skewness=0, but
difference Kurtosis.
20
 D: Laplace distribution, also known as the double exponential
distribution, red curve (two straight lines in the log-scale plot), excess
kurtosis = 3
 S: hyperbolic secant distribution, orange curve, excess kurtosis = 2
 L: logistic distribution, green curve, excess kurtosis = 1.2
 N: normal distribution, black curve (inverted parabola in the log-scale
plot), excess kurtosis = 0
 C: raised cosine distribution, cyan curve, excess kurtosis =
−0.593762...
 W: Wigner semicircle distribution, blue curve, excess kurtosis = −1
 U: uniform distribution, magenta curve (shown for clarity as a
rectangle in both images), excess kurtosis = −1.2.
 Important note: same SD and variance. SD just cannot tell the
peak and tail!
21
Moments (again)
 It so happens that mean, SD, skewness and kurtosis are respectively
the first 4 moments of a distribution.
 Definitions of moments are as follows.
E( X) = ʃ x p(x) dx, for r = 1
E( Xr) = ʃ (x-µ)r p(x) dx, for r > 1
We take (x-µ) instead of x, i.e. moment about the mean µ (=E(X))
Implication of moments
 If we do not know the distribution, but we know the first 4 moments
(from our data, say), we can accurately (but not exact) calculate the
tails (or percentiles or areas, etc) of that distribution
 Or, if two distributions have first 4 moments similar, they are similar
 Or, if two have first 3 moments similar, they are “quite” similar
22
Linear transformation
y=a+bx
By linear transformation, we can always equate two distributions with
the same mean and SD, i.e., the first 2 moments.
Log transformation
y = log (x)
transforms the range of x (0 to ∞) to y (-∞ to +∞)
Logit transformation
y = logit (x) = log ( x / (1-x) )
transforms x from the range (0, 1) to y (-∞ to +∞)
Univariate and Multivariate Distribution
 Univariate: one dimension
 Multivariate: several dimensions
23
 Multiple “多元” regression: one y (dependent variable) and many x’s
(independent variable)
 Multivariate “多維” regression: many y’s
Correlation
 Correlation: among many variables,
 Population correlations: within a probability model
 Empirical correlations: within a real data set
24
Conditional Probability, and conditional probability distributions
Conditional Probability:
Pr(A/B) = Pr(A and B) / Pr(B)
Conditional probability distribution
p(y/x) = p(y and x ) / p(x)
Now, x and y are random variables following specified probability
distributions.
25
Estimation
 There can be “parameters” within “a class” model, e.g. Normal with
mean μ and variance σ.
 To specify a model, each parameter can take specific values, e.g.
μ=50, σ=10.
 Parameter estimation refer to estimating a specific value of
parameters within a class of distribution, e.g. estimating μ and σ
within Normal
 3 Q: (i) what bases? (ii) what criteria?
(iii) what method?
 (i) based on our data
 (ii) commonly used criteria: maximum likelihood, least square, etc
 (iii) method, the way to do it (leave it to statistician)
 MLE (maximum likelihood estimation) refer to finding parameters
that maximize the chance of getting the data, i.e. maximize the pdf.
It is one of the popular methods in finding out parameters.
26
Let, X = data, θ = parameters,
Pr(X/θ) (pdf of X given θ) = L(θ/X) (Likelihood of θ given X)
 Mathematically, both size are the same, but
 Pr(X/θ), treat X as variable, θ as fixed (chance of X given θ)
 L(θ/X), treat θ as variable, X is fixed (what values of θ that can
maximize L given that we got X, our data)
Some of the above details are quite complicated, but you only need to
know them conceptually, not technically. Nowadays, computer
packages will do it for you, but you need to understand the background
behind for interpretation purposes.
27
Concluding remark
 A probability model is a model. It is not real, but has practical
implications.
 These content are basic to learn, and to understand, other topics (even
though you are doing qualitative research)
 A framework to under our uncertain world
28
Next lecture: On Hypothesis Testing (based on knowledge of
probability theory)
Q&A
Shing On LEUNG
[email protected]
Hui Ping WU
[email protected]
29