Download Random Variables and Distributions

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Random Variables and
Distributions
COMP5318
Knowledge Discovery and Data
Mining
Examples
Examples
• We have heard of statements like “Height
is Normally Distributed”
Standard
deviation
mean
Why distributions are important
• Distribution capture the essence of data
associated with a particular variable(s) (e.g.,
height).
• If we know height is Normally distributed then a
small random sample is enough to provide a
very good idea about the general population.
• Can answer questions like: what is the
probability of finding a 2 meter tall Australian?
• Need to understand the concept of random
variable.
Random Variable
• Let S be the sample space.
• A random variable X is a function
X: SReal
Suppose we toss a coin twice. Let X be the
random variable number of heads
Random Variable
(Number of Heads in two coin tosses)
S
X
TT
0
TH
1
HT
1
HH
2
We also associate a probability with X attaining that value.
Random Variable
(Number of Heads in two coin tosses)
S
Prob
X
TT
1/4
0
X
P(X=x)
TH
1/4
1
0
1/4
HT
1/4
1
1
1/2
HH
1/4
2
2
1/4
Random Variables follow a
Distribution
• The height of Australian soldiers is a random variable which follows
a Normal distribution with mean 180 cm and standard deviation 15
cm.
• The frequency of words in a text is a random variable which
follows a Zipf distribution.
• The speed of a hurricane is a random variable which follows a
Cauchy distribution.
• The number of car accidents in a fixed time duration is a random
variable which follows a Poisson distribution.
• The number of heads in a sequence of coin tosses is a random
variable which follows a Binomial distribution.
• The number of web hits in a given time period is a r.v. which
follows a Pareto distribution.
• Many times we don’t know what named distribution a r.v. follows or
whether it follows any named distribution at all!
Distribution Definitions
• Discrete Probability Distribution
• Continuous Probability Distribution
• Cumulative Distribution Function
Discrete Distribution
• A r.v. X is discrete if it takes countably many
values {x1,x2,….}
• The probability function or probability mass
function for X is given by
– fX(x)= P(X=x)
• From previous example
1 / 4
1 / 2

f X ( x)  
1 / 4

 0
x0
x 1
x2
otherwise
Continuous Distributions
• A r.v. X is continuous if there exists a function fX
such that

f
fX  0
X
( x)dx  1

b
P(a  x  b)   f X ( x)dx
a
Example: Continuous Distribution
• Suppose X has the pdf
1 0  x  1
f X ( x)  
0 otherwise
• This is the Uniform (0,1) distribution
Binomial Distribution
• A coin flips Heads with probability p. Flip it n
times and let X be the number of Heads.
Assume flips are independent.
• Let f(x) =P(X=x), then
 n  x
  p (1  p) n  x
f ( x)   x 

0

x  0,1,...n
otherwise
Binomial Example
• Let p =0.5; n = 5 then
 5
4
P( X  4)   0.5 (1  0.5)54  0.1562
 4
• In Matlab
>>binopdf(4,5,0.5)
Normal Distribution
• X has a Normal (Gaussian) distribution with
parameters μ and σ if
f ( x) 
1
 1

exp  2 ( x   ) 2 
 2
 2

• X is standard Normal if μ =0 and σ =1. It is
denoted as Z.
• If X ~ N(μ,
σ 2)
then
X 

~Z
Normal Example
• The number of spam emails received by a email server in
a day follows a Normal Distribution N(1000,500). What is
the probability of receiving 2000 spam emails in a day?
• Let X be the number of spam emails
received in a day. We want P(X = 2000)?
• The answer is P(X=2000) = 0;
• It is more meaningful to ask P(X >= 2000);
Normal Example
• This is
2000
P( X  2000)  1  P( X  2000)  1   f ( x)dx

• In Matlab: >> 1 –normcdf(2000,1000,500)
• The answer is 1 – 0.9772 = 0.0228 or 2.28%
• This type of analysis is so common that there is a
special name for it: cumulative distribution function F.
x
F ( x)  P( X  x) 
 f ( x)dx

Outliers
• In data mining we are often interested in
outliers
– especially in high dimensional data which we
cannot easily visualize
• A knowledge of distributions can be very
useful in this context.
• Lets see how?
Outliers in Normal Distribution
• Conventionally something is considered
an outlier if it is at least three standard
deviations away from the mean:
• Lets assume we have a standard Normal
Distribution: N(0,1)
• We want P(X < -3) + P(X >3)
• = normcdf(-3,0,1) + 1 – normcdf(3,0,1)=0.0027
Outliers using Univariate Normal
Distribution
•
Typically we are given data and we want to find
outliers in the data –if any.
• Here are the steps:
1. Make the assumption that the data come from
a Normal distribution.
2. Estimate the parameters of the Normal
distribution.
3. Find all data points which are more than three
standard deviations away from the mean.
Outliers in Multidimensional Data
• Recall, in the Iris data, we have four attributes
and one class label.
• This is an example of multidimensional data set.
• Look at the exponent of the Normal distribution.
 x 
2

  ( x   ) ( x   )
  
2
• This is the square of the distance from a point x
to the mean μ in units of standard deviation σ
Outliers in Multidimensional Data
• In multidimensional data this can be
generalized to:
( x   ) ( x   )'
1
• This is called the Mahalanobis Distance
(squared)
• Σ is d x d matrix called the variance-covariance
matrix
Variance-Covariance Matrix
If the Data set is an N x d matrix then
  11 ..  1d 


    12 ..  2 d 



..
dd 
 d1
In Matlab
• Suppose we generate a random 100x5
data >> data = rand(100,5);
• The covariance matrix is >>cv =cov(data)
0.0998
-0.0022
0.0006
-0.0080
-0.0025
-0.0022
0.0933
-0.0051
-0.0100
-0.0010
0.0006 -0.0080 -0.0025
-0.0051 -0.0100 -0.0010
0.0810 -0.0085 0.0083
-0.0085 0.0820 0.0071
0.0083 0.0071 0.0859
Intuitive: Mahalanobis Distance
Distribution of Mahalanobis
Distance
• It turns out that if an N x d data set A if
from a multivariate Normal Distribution
then the Mahalanobis distance follows a a
Chi-Square distribution with d degrees of
freedom.
Chi-Square Distribution
Curse of dimensionality
Algorithm for Finding Outliers
>>chi2inv(.975,d)
Homework
• Define first, second, third quantile in terms
of cumulative distribution function?
• Use that to understand the previous
algorithm.
• Start looking up Matlab help files in the
Statistics toolbox.
• Also, figure out what is the meaning of
“estimating the parameter of a distribution
from data”.
Related documents