Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Random Variables and
Distributions
COMP5318
Knowledge Discovery and Data
Mining
Examples
Examples
• We have heard of statements like “Height
is Normally Distributed”
Standard
deviation
mean
Why distributions are important
• Distribution capture the essence of data
associated with a particular variable(s) (e.g.,
height).
• If we know height is Normally distributed then a
small random sample is enough to provide a
very good idea about the general population.
• Can answer questions like: what is the
probability of finding a 2 meter tall Australian?
• Need to understand the concept of random
variable.
Random Variable
• Let S be the sample space.
• A random variable X is a function
X: SReal
Suppose we toss a coin twice. Let X be the
random variable number of heads
Random Variable
(Number of Heads in two coin tosses)
S
X
TT
0
TH
1
HT
1
HH
2
We also associate a probability with X attaining that value.
Random Variable
(Number of Heads in two coin tosses)
S
Prob
X
TT
1/4
0
X
P(X=x)
TH
1/4
1
0
1/4
HT
1/4
1
1
1/2
HH
1/4
2
2
1/4
Random Variables follow a
Distribution
• The height of Australian soldiers is a random variable which follows
a Normal distribution with mean 180 cm and standard deviation 15
cm.
• The frequency of words in a text is a random variable which
follows a Zipf distribution.
• The speed of a hurricane is a random variable which follows a
Cauchy distribution.
• The number of car accidents in a fixed time duration is a random
variable which follows a Poisson distribution.
• The number of heads in a sequence of coin tosses is a random
variable which follows a Binomial distribution.
• The number of web hits in a given time period is a r.v. which
follows a Pareto distribution.
• Many times we don’t know what named distribution a r.v. follows or
whether it follows any named distribution at all!
Distribution Definitions
• Discrete Probability Distribution
• Continuous Probability Distribution
• Cumulative Distribution Function
Discrete Distribution
• A r.v. X is discrete if it takes countably many
values {x1,x2,….}
• The probability function or probability mass
function for X is given by
– fX(x)= P(X=x)
• From previous example
1 / 4
1 / 2
f X ( x)
1 / 4
0
x0
x 1
x2
otherwise
Continuous Distributions
• A r.v. X is continuous if there exists a function fX
such that
f
fX 0
X
( x)dx 1
b
P(a x b) f X ( x)dx
a
Example: Continuous Distribution
• Suppose X has the pdf
1 0 x 1
f X ( x)
0 otherwise
• This is the Uniform (0,1) distribution
Binomial Distribution
• A coin flips Heads with probability p. Flip it n
times and let X be the number of Heads.
Assume flips are independent.
• Let f(x) =P(X=x), then
n x
p (1 p) n x
f ( x) x
0
x 0,1,...n
otherwise
Binomial Example
• Let p =0.5; n = 5 then
5
4
P( X 4) 0.5 (1 0.5)54 0.1562
4
• In Matlab
>>binopdf(4,5,0.5)
Normal Distribution
• X has a Normal (Gaussian) distribution with
parameters μ and σ if
f ( x)
1
1
exp 2 ( x ) 2
2
2
• X is standard Normal if μ =0 and σ =1. It is
denoted as Z.
• If X ~ N(μ,
σ 2)
then
X
~Z
Normal Example
• The number of spam emails received by a email server in
a day follows a Normal Distribution N(1000,500). What is
the probability of receiving 2000 spam emails in a day?
• Let X be the number of spam emails
received in a day. We want P(X = 2000)?
• The answer is P(X=2000) = 0;
• It is more meaningful to ask P(X >= 2000);
Normal Example
• This is
2000
P( X 2000) 1 P( X 2000) 1 f ( x)dx
• In Matlab: >> 1 –normcdf(2000,1000,500)
• The answer is 1 – 0.9772 = 0.0228 or 2.28%
• This type of analysis is so common that there is a
special name for it: cumulative distribution function F.
x
F ( x) P( X x)
f ( x)dx
Outliers
• In data mining we are often interested in
outliers
– especially in high dimensional data which we
cannot easily visualize
• A knowledge of distributions can be very
useful in this context.
• Lets see how?
Outliers in Normal Distribution
• Conventionally something is considered
an outlier if it is at least three standard
deviations away from the mean:
• Lets assume we have a standard Normal
Distribution: N(0,1)
• We want P(X < -3) + P(X >3)
• = normcdf(-3,0,1) + 1 – normcdf(3,0,1)=0.0027
Outliers using Univariate Normal
Distribution
•
Typically we are given data and we want to find
outliers in the data –if any.
• Here are the steps:
1. Make the assumption that the data come from
a Normal distribution.
2. Estimate the parameters of the Normal
distribution.
3. Find all data points which are more than three
standard deviations away from the mean.
Outliers in Multidimensional Data
• Recall, in the Iris data, we have four attributes
and one class label.
• This is an example of multidimensional data set.
• Look at the exponent of the Normal distribution.
x
2
( x ) ( x )
2
• This is the square of the distance from a point x
to the mean μ in units of standard deviation σ
Outliers in Multidimensional Data
• In multidimensional data this can be
generalized to:
( x ) ( x )'
1
• This is called the Mahalanobis Distance
(squared)
• Σ is d x d matrix called the variance-covariance
matrix
Variance-Covariance Matrix
If the Data set is an N x d matrix then
11 .. 1d
12 .. 2 d
..
dd
d1
In Matlab
• Suppose we generate a random 100x5
data >> data = rand(100,5);
• The covariance matrix is >>cv =cov(data)
0.0998
-0.0022
0.0006
-0.0080
-0.0025
-0.0022
0.0933
-0.0051
-0.0100
-0.0010
0.0006 -0.0080 -0.0025
-0.0051 -0.0100 -0.0010
0.0810 -0.0085 0.0083
-0.0085 0.0820 0.0071
0.0083 0.0071 0.0859
Intuitive: Mahalanobis Distance
Distribution of Mahalanobis
Distance
• It turns out that if an N x d data set A if
from a multivariate Normal Distribution
then the Mahalanobis distance follows a a
Chi-Square distribution with d degrees of
freedom.
Chi-Square Distribution
Curse of dimensionality
Algorithm for Finding Outliers
>>chi2inv(.975,d)
Homework
• Define first, second, third quantile in terms
of cumulative distribution function?
• Use that to understand the previous
algorithm.
• Start looking up Matlab help files in the
Statistics toolbox.
• Also, figure out what is the meaning of
“estimating the parameter of a distribution
from data”.