Download Random Variables and Distributions

Random Variables and Distributions COMP5318 Knowledge Discovery and Data Mining Examples Examples • We have heard of statements like “Height is Normally Distributed” Standard deviation mean Why distributions are important • Distribution capture the essence of data associated with a particular variable(s) (e.g., height). • If we know height is Normally distributed then a small random sample is enough to provide a very good idea about the general population. • Can answer questions like: what is the probability of finding a 2 meter tall Australian? • Need to understand the concept of random variable. Random Variable • Let S be the sample space. • A random variable X is a function X: SReal Suppose we toss a coin twice. Let X be the random variable number of heads Random Variable (Number of Heads in two coin tosses) S X TT 0 TH 1 HT 1 HH 2 We also associate a probability with X attaining that value. Random Variable (Number of Heads in two coin tosses) S Prob X TT 1/4 0 X P(X=x) TH 1/4 1 0 1/4 HT 1/4 1 1 1/2 HH 1/4 2 2 1/4 Random Variables follow a Distribution • The height of Australian soldiers is a random variable which follows a Normal distribution with mean 180 cm and standard deviation 15 cm. • The frequency of words in a text is a random variable which follows a Zipf distribution. • The speed of a hurricane is a random variable which follows a Cauchy distribution. • The number of car accidents in a fixed time duration is a random variable which follows a Poisson distribution. • The number of heads in a sequence of coin tosses is a random variable which follows a Binomial distribution. • The number of web hits in a given time period is a r.v. which follows a Pareto distribution. • Many times we don’t know what named distribution a r.v. follows or whether it follows any named distribution at all! Distribution Definitions • Discrete Probability Distribution • Continuous Probability Distribution • Cumulative Distribution Function Discrete Distribution • A r.v. X is discrete if it takes countably many values {x1,x2,….} • The probability function or probability mass function for X is given by – fX(x)= P(X=x) • From previous example 1 / 4 1 / 2  f X ( x)   1 / 4   0 x0 x 1 x2 otherwise Continuous Distributions • A r.v. X is continuous if there exists a function fX such that  f fX  0 X ( x)dx  1  b P(a  x  b)   f X ( x)dx a Example: Continuous Distribution • Suppose X has the pdf 1 0  x  1 f X ( x)   0 otherwise • This is the Uniform (0,1) distribution Binomial Distribution • A coin flips Heads with probability p. Flip it n times and let X be the number of Heads. Assume flips are independent. • Let f(x) =P(X=x), then  n  x   p (1  p) n  x f ( x)   x   0  x  0,1,...n otherwise Binomial Example • Let p =0.5; n = 5 then  5 4 P( X  4)   0.5 (1  0.5)54  0.1562  4 • In Matlab >>binopdf(4,5,0.5) Normal Distribution • X has a Normal (Gaussian) distribution with parameters μ and σ if f ( x)  1  1  exp  2 ( x   ) 2   2  2  • X is standard Normal if μ =0 and σ =1. It is denoted as Z. • If X ~ N(μ, σ 2) then X   ~Z Normal Example • The number of spam emails received by a email server in a day follows a Normal Distribution N(1000,500). What is the probability of receiving 2000 spam emails in a day? • Let X be the number of spam emails received in a day. We want P(X = 2000)? • The answer is P(X=2000) = 0; • It is more meaningful to ask P(X >= 2000); Normal Example • This is 2000 P( X  2000)  1  P( X  2000)  1   f ( x)dx  • In Matlab: >> 1 –normcdf(2000,1000,500) • The answer is 1 – 0.9772 = 0.0228 or 2.28% • This type of analysis is so common that there is a special name for it: cumulative distribution function F. x F ( x)  P( X  x)   f ( x)dx  Outliers • In data mining we are often interested in outliers – especially in high dimensional data which we cannot easily visualize • A knowledge of distributions can be very useful in this context. • Lets see how? Outliers in Normal Distribution • Conventionally something is considered an outlier if it is at least three standard deviations away from the mean: • Lets assume we have a standard Normal Distribution: N(0,1) • We want P(X < -3) + P(X >3) • = normcdf(-3,0,1) + 1 – normcdf(3,0,1)=0.0027 Outliers using Univariate Normal Distribution • Typically we are given data and we want to find outliers in the data –if any. • Here are the steps: 1. Make the assumption that the data come from a Normal distribution. 2. Estimate the parameters of the Normal distribution. 3. Find all data points which are more than three standard deviations away from the mean. Outliers in Multidimensional Data • Recall, in the Iris data, we have four attributes and one class label. • This is an example of multidimensional data set. • Look at the exponent of the Normal distribution.  x  2    ( x   ) ( x   )    2 • This is the square of the distance from a point x to the mean μ in units of standard deviation σ Outliers in Multidimensional Data • In multidimensional data this can be generalized to: ( x   ) ( x   )' 1 • This is called the Mahalanobis Distance (squared) • Σ is d x d matrix called the variance-covariance matrix Variance-Covariance Matrix If the Data set is an N x d matrix then   11 ..  1d        12 ..  2 d     .. dd   d1 In Matlab • Suppose we generate a random 100x5 data >> data = rand(100,5); • The covariance matrix is >>cv =cov(data) 0.0998 -0.0022 0.0006 -0.0080 -0.0025 -0.0022 0.0933 -0.0051 -0.0100 -0.0010 0.0006 -0.0080 -0.0025 -0.0051 -0.0100 -0.0010 0.0810 -0.0085 0.0083 -0.0085 0.0820 0.0071 0.0083 0.0071 0.0859 Intuitive: Mahalanobis Distance Distribution of Mahalanobis Distance • It turns out that if an N x d data set A if from a multivariate Normal Distribution then the Mahalanobis distance follows a a Chi-Square distribution with d degrees of freedom. Chi-Square Distribution Curse of dimensionality Algorithm for Finding Outliers >>chi2inv(.975,d) Homework • Define first, second, third quantile in terms of cumulative distribution function? • Use that to understand the previous algorithm. • Start looking up Matlab help files in the Statistics toolbox. • Also, figure out what is the meaning of “estimating the parameter of a distribution from data”.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Random Variables and Distributions