Download Data analysis and uncertainty

Data analysis and uncertainty Outline • Random Variables • Estimate • Sampling Introduction • Reasons for Uncertainty – Prediction • Making a prediction about tomorrow based on data we have today – Sample • Data maybe a sample from the population, and we don’t know the difference between our data and other sample(or population) – Missing value or unknown value • We need to guess these value • Example : Censored Data Introduction • Dealing with Uncertainty – Probability – Fuzzy • Probability Theory v.s. Probability Calculus – Probability Theory • Mapping from real world to the mathematical representation – Probability Calculus • Based on well-defined and generally accepted axioms • The aim is to explore the consequences of those axioms Introduction • Frequentist (Probability is objective) – The probability of an event is defined as the limiting proportion of times that the event would occur in identical situations – Example • The proportion of times a head comes up in tossing a same coin repeatedly • Assess the probability that a customer in a supermarket will buy a certain item(Use similarly customer) Introduction • Bayesian(Subjective probability) – Explicit characterization of all uncertainty including any parameters estimated from the data – Probability is an individual degree of belief that a given event will occur • Frequentist v.s. Bayesian – Toss a coin 10 times, get 7 head – In Frequentist, probability is P(A) = 7/10 – In Bayesian, I guess a probability P(A) = 0.5, then use this prior idea and the data to estimate probability Random variable • Mapping from property of objects to a variable that can take a set of possible values via a process that appears to the observer to have an element of unpredictability • Example – Coin toss (domain is the set [heads , tails]) – No of times a coin has to be tossed to get a head • Domain is integers – Student’s score • Domain is a set of integers between 0~100 Properties of single random variable • X is random variable and x is its value • Domain is finite: – probability mass function p(x) • Domain is real line: – probability density function f(x) • Expectation of X  – E[ X ]   xi p( xi ) i 1 – E[ X ]   x f ( x )dx  i i i Multivariate random variable • Set of several random variables • For p-dimensional vector x={x1,..,xp} • The joint mass function p( X 1  x1 , X p  x p )  p( x1 ,, x p ) The joint mass function • For example – Rolling two fair dice, X represent first dice’s result and Y represent another – Then p(x=3, y=3) = 1/6 * 1/6 = 1/36 The joint mass function X=1 X=2 X=3 X=4 X=5 X=6 Px(X) Y=1 1/36 1/36 1/36 1/36 1/36 1/36 0.17 Y=2 1/36 1/36 1/36 1/36 1/36 1/36 0.17 Y=3 1/36 1/36 1/36 1/36 1/36 1/36 0.17 Y=4 1/36 1/36 1/36 1/36 1/36 1/36 0.17 Y=5 1/36 1/36 1/36 1/36 1/36 1/36 0.17 Y=6 1/36 1/36 1/36 1/36 1/36 1/36 0.17 Py(Y) 0.17 0.17 0.17 0.17 0.17 0.17 1.00 Marginal probability mass function • The marginal probability mass function of X and Y are f X ( x)  p( X  x)   p XY ( x, y ) y fY ( y)  p(Y  y)   p XY ( x, y) x Continuous • Marginal probability density function of X and Y are f X ( x)     fY ( y )     f XY ( x, y )dy f XY ( x, y )dx Conditional probability • Density of a single variable (or a subset of complete set of variables) given (or “conditioned on”) particular values of other variables • Conditional density of X given some value of Y is denoted f(x|y) and defined as f ( x, y ) f ( x | y)  f ( y) Conditional probability • For example – If a student’s score is given at random – Sample space is S = {0,1,…,100} – What’s the probability that the student is fail? • P( F )  F  S 60 101 – Given that student’s score is even(including 0), then what’s the probability that the student is fail? • E  51 E  F  30 P( F | E )  P( E  F ) 30 / 101  P( E ) 51 / 101 Supermarket data Conditional independence • Generic problem in data mining is finding relationships between variables – Is purchasing item A likely to be related to purchasing item B? • Variables are independent if there is no relationship; otherwise they are dependent • Independent if p(x,y)=p(x)p(y) p( A  B) p( A) p( B) p( A | B)    p( A) p( B) p( B) p( A  B) p( A) p( B) p( B | A)    p( B) p( A) p( A) Conditional Independence: More than 2 variables • X is conditional independence of Y – Given Z if for all values of X, Y, Z we have – p ( x, y | z )  p ( x | z ) p ( y | z ) Conditional Independence: More than 2 variables • Example – P(F)=60/101 – P(E∩F)=30/51 – Now E and F are dependence – If student’s score !=100, then • • • • P(F|B)=60/100 P(E|B)=1/2 P(E∩F|B)=30/100=60/100*1/2 Given B condition，E and F are independence Conditional Independence: More than 2 variables • Example – If student’s score == 100，then • • • • P(F|C)=0 P(E|C)=1 P(E ∩ F|C)=0=1*0 Given C condition，E and F are independence – Now we can calculate P(E ∩ F) p( E  F )  p( E  F | grade  100) p( grade  100)  p( E  F | grade  100) p( grade  100) 100  0  0.5  0.6  101 30  101 Conditional Independence • Conditional independence don’t imply marginal independence p ( x, y | z )  p ( x | z ) p ( y | z ) not imply p ( x, y )  p ( x ) p ( y ) • Note that X and Y may be unconditionally independence but conditionally dependent given Z On assuming independence • Independence is a strong assumption frequently violated in practice • But provides modeling – Fewer parameters – Understandable models Dependence and Correlation • Covariance measures how X and Y vary together – Large positive if large X is associated with large Y， and small X with small Y – Negative if large X is associated with small Y • Two variables may be dependent but no linearly correlated Correlation and Causation • Two variables may be highly correlated without a causal relationship between the two – Yellow stained finger and lung cancer may be correlated but causally linked only by a third variable : smoking – Human reaction time and earned income are negatively correlated • Does not mean one causes the other • A third variable “age” is causally related to both Samples and Statistical inference • Samples can be used to model the data • If goal is to detect the small deviations form the data，the size of samples will effect the result Dual Role of Probability and Statistics in Data Analysis Outline • Random Variable • Estimate – Maximum Likelihood Estimation – Bayesian Estimation • Sampling Estimation • In inference we want to make statements about entire population from which sample is drawn • The two important methods for estimating parameters of a model – Maximum Likelihood Estimation – Bayesian Estimation Desirable properties of estimators • Let ˆ be an estimate of parameter  • Two measures of estimator quality ˆ – Expected value of estimate (Bias) • Difference between expected and true value Bias (ˆ)  E[ˆ]   – Variance of Estimate Var (ˆ)  E[ˆ  E[ˆ]]2 Mean squared error • The mean of the squared difference between the value of the estimator and the true value of parameter E[(ˆ   ) 2 ] • Mean squared error can be partitioned as sum of squared bias and variance E[(ˆ   ) 2 ]  ( Bias (ˆ)) 2  Var (ˆ) Mean squared error E[(ˆ   ) 2 ]  E[(ˆ  E[ˆ]  E[ˆ]   ) 2 ]  E[(ˆ  E (ˆ)) 2  ( E (ˆ)   ) 2  2( E[(ˆ  E[ˆ])( E[ˆ]   )])]  Var (ˆ)  Bias 2 (ˆ)  2 E[ˆE[ˆ]  ˆ  E[ˆ]E[ˆ]  E[ˆ] ]  Var (ˆ)  Bias 2 (ˆ)  2[( E (ˆ)) 2  ( E (ˆ)) 2  E (ˆ)  E (ˆ)]  Var (ˆ)  Bias 2 (ˆ) (a  b  b  c) 2  (a  b) 2  (b  c) 2  2(a  b)(b  c)  a 2  2ab  b 2  b 2  2bc  c 2  2ab  2ac  2b 2  2bc  a 2  2ac  b 2 E[ E[ X ]]  E[ X ] E[c]  c where c is a constant Maximum Likelihood Estimation • Most widely used method for parameter estimation • Likelihood Function is probability that data D would have arisen for a given value of θ L( | D)  L( | x(1),  , x(n))  p ( x(1),  x(n) |  ) n   p ( x(i ) |  ) i 1 • Value of θ for which the data has the highest probability is the MLE Example of MLE for Binomial • Customers either purchase or not purchase milk – We want estimate of proportion purchasing • Binomial with unknown parameterθ • Samples x(1),…,x(1000) where r purchase milk • Assuming conditional independence， likelihood function is 1000 L( | x(1),  x(1000))    x (i ) (1   ) (1 x (i ))   r (1   ) (1000 r ) i Log-likelihood Function • We want the highest probability，so change to Log-likelihood function l ( )  log L( )  r log(  )  (1000  r ) log( 1   ) • Then Differentiating and setting equal to zero r log(  )  (1000  r ) log( 1   ) r (1000  r )   d  (1   ) r (1000  r )  Let  0  (1   )  r  r  1000  r r   1000 Example of MLE for Binomial • r milk purchases out of n customers • θis the probability that milk is purchased by random customer • For 3 data set – r = 7，n =10 – r = 70，n =100 – r = 700，n =1000 • Uncertainty becomes smaller as n increases Example of MLE for Binomial Likelihood under Normal Distribution • For 1 variance，Unknown mean • Likelihood function n 1 L( | x(1), , x(n))   (2 ) 1/ 2 exp(  ( x(i )   ) 2 ) 2 i 1  (2 ) n / 2 1 n exp(   ( x(i )   ) 2 ) 2 i 1  ( x   )2 f ( x)  e 2 2 2  2 1 Log-likelihood function n 1 n l ( | x(1), , x(n))   log 2   ( x(i )   ) 2 2 2 i 1 • To find the MLE set derivative d/dθ to zero dl ( ) n   ( x(i )   ) d i 1 n  Let  ( x(i )   )  0 i 1 n n i 1 i 1   ( x(i )   )  x(i )  n  0 n    x(i) i 1 n Likelihood under Normal Distribution • θis the estimated mean • For 2 data set(By random) – 20 data points – 200 data points Likelihood under Normal Distribution Sufficient statistic • Quantity s(D) is a sufficient statistic forθ if the likelihood l(θ) only depends on the data through s(D) • no other statistic which can be calculated from the same sample provides any additional information as to the value of the parameter Interval estimate • Point estimate doesn’t convey uncertainty associated with it • Interval estimate provide a confidence interval Likelihood under Normal Distribution Normal distributi on :  ( x   )2 f ( x)  exp( ) 2 2 2 2 Likelihood function : 1 n 2 2 L( | x1 ,  xn )  (2 ) exp(  n 1 2 2 2 ( x   ) )  i i 1 Log  likelihood function : n 1 2 l ( | x1 ,  xn )  log( 2 )  2 2 2 n 2 ( x   )  i i 1 Mean 1 n l ( | x1 ,  xn )  log( 2 2 )  2 2 2 dl ( | x1 ,  xn ) 1  2  d  1  2 n  (x  )  (x  )  0 i i 1 n   ( xi   )  0 i 1 n   ( xi )  n  0 i 1 n   (x ) i i 1 n n i 1 i n  (x  ) i 1 i 2 Variance n 1 2 l ( | x1 , xn )  log( 2 )  2 2 2 n 2 ( x   )  i i 1 dl ( | x1 ,  xn ) n 1 1   2  ( ) 2 2 d 2 2 2 4 n 1 1   2  ( )  2 2 2 2 4  n 2 2  1 2 4 n 2 ( x   )  i i 1 1 n     ( xi   ) 2 n i 1 2 n n 2 ( x   )  i i 1 2 ( x   ) 0  i i 1 Outline • Random Variable • Estimate – Maximum Likelihood Estimation – Bayesian Estimation • Sampling Bayesian approach • Frequestist approach – The parameters of population are fixed but unknown – Data is a random sample – Intrinsic variability lies in data D  {x(1), , x(n)} • Bayesian approach – – – – Data are known Parameters θ are random variables θhas a distribution of values p ( ) reflects degree of belief on where true parameters θ may be Bayesian estimation • Modification done by Bayesian rule p( | D)  p( D |  ) p( )  p( D) p( D |  ) p( )  p( D | ) p( )d • Leads to a distribution rather than single value – Single value can be obtained by mean or mode Bayesian estimation • P(D) is a constant independent of θ • For a given data set D and a particular model(model = distribution for prior and likelihood) p( | D)  p( D |  ) p( ) • If we have a weak belief about parameter before collecting data, choose a wide prior(normal with large variance) Binomial example • Single binary variable X : wish to estimate   p ( X  1) • Prior for parameter in [0, 1] is the Beta distribution p( )   ( 1) (1   ) (  1) where   0 and   0 are two parameters of this model (   ) ( 1) Beta( |  ,  )   (1   ) (  1) ( )(  ) Binomial example • Likelihood function L( | D)   (1   ) • Combining likelihood and prior r ( n r ) p( | D)  p( D |  ) p( )   r (1   ) nr  ( 1) (1   )(  1)   ( r  1) (1   )( nr   1) • We get another Beta distribution – With parameters r   and n  r   Beta(5,5) and Beta(145,145) Beta(5,5) Beta(45,50) Advantages of Bayesian approach • Retain full knowledge of all problem uncertainty • Calculating full posterior distribution onθ • Natural updating of distribution p( | D1 , D2 )  p( D2 |  ) p( D1 |  ) p( ) Predictive distribution • In equation to modify prior to posterior p( | D)  p( D |  ) p( )  p( D) p( D |  ) p( )  p( D | ) p( )d • Denominator is called predictive distribution of D • Useful for model checking – If observed data have only small probability then it is unlikely to be correct Normal distribution example • Suppose x comes from a normal distribution With unknown mean θand known variance α • Prior distribution for θis  ~ N (0 ,  0 ) Normal distribution example p ( | x)  p ( x |  ) p ( ) 1 1 1 1 2  exp(  (x  ) ) exp(  (   0 ) 2 ) 2 2 0 2 20 1 2 1 1  x x 2 0  exp(   (  )   (  )   ) 2 0   0  2 2 0 2 1 let 1  ( 0  1) 1 1  1 ( 0 x  ) 0  1 2 1  exp(   / 1  1 / 1 )  exp(  (  1 ) 2 / 1 ) 2 2 1 1 p ( | x)  exp(  (  1 ) 2 / 1 ) 2 21 Jeffrey’s prior • A reference prior • Fisher information  log L( | x) I ( | x)   E[ ] 2  2 • Jeffrey’s prior p( )  I ( | x) Conjugate priors • p(θ) is a conjugate prior for p(x| θ) if the posterior distribution p(θ|x) is in the same family as the prior p(θ) – Beta to Beta – Normal distribution to Normal distribution Outline • Random Variable • Estimate – Maximum Likelihood Estimation – Bayesian Estimation • Sampling Sampling in Data Mining • The data set is only fit statistical analysis – “Experimental design” in statistics is concerned with optimal ways of collecting data – Data miners can’t control the data collection process – The data may be ideally suited to the purposes for which it was collected, but not adequate for its data mining uses Sampling in Data Mining • Two ways in which sample arise – Database is sample of population – Database contains every cases, but the analysis is based on the sample • Not appropriate when we want to find unusual records Why sampling • Draw a sample from the database that allows us to construct a model reflects the structure of the data in the database – Efficiency, quicker, easier • The sample must representative of the entire database Systematic sampling • Try to ensure representativeness – Taking one out of every two records • Can lead to problems when there are regularities in database – Data set where records are of married couples Random Sampling • Avoiding regularities • Epsem Sampling – Each record has same probability of being chosen Variance of Mean of Random Sample • If variance of population of size N is  , the variance of mean of a simple random sample of size n without replacement is  (1  n ) 2 2 n N – Usually N >> n, so the second term is small, and variance decreases as sample size increases Example • 2000 points, population mean = 0.0108 • Random sample n = 10, 100, 1000, repeat 200 times Example Stratified Random Sampling • Split population into non-overlapping subpopulations or strata • Advantages – Enable making statements about each of the subpopulations separately • For example, one of the credit card companies we work with categorizes transactions into 26 categories : supermarket, gas station, and so on Mean of Stratified Sample • The total size of population is N th • k stratum has N k elements in it • nk are chosen for the sample from this stratum th • Sample mean within k stratum is xk • Estimate of population mean N k xk  N Cluster Sampling • Every cluster contains many elements • Simple random sample on elements is not appropriate • Select cluster, not element

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Data analysis and uncertainty