Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data analysis and uncertainty Outline • Random Variables • Estimate • Sampling Introduction • Reasons for Uncertainty – Prediction • Making a prediction about tomorrow based on data we have today – Sample • Data maybe a sample from the population, and we don’t know the difference between our data and other sample(or population) – Missing value or unknown value • We need to guess these value • Example : Censored Data Introduction • Dealing with Uncertainty – Probability – Fuzzy • Probability Theory v.s. Probability Calculus – Probability Theory • Mapping from real world to the mathematical representation – Probability Calculus • Based on well-defined and generally accepted axioms • The aim is to explore the consequences of those axioms Introduction • Frequentist (Probability is objective) – The probability of an event is defined as the limiting proportion of times that the event would occur in identical situations – Example • The proportion of times a head comes up in tossing a same coin repeatedly • Assess the probability that a customer in a supermarket will buy a certain item(Use similarly customer) Introduction • Bayesian(Subjective probability) – Explicit characterization of all uncertainty including any parameters estimated from the data – Probability is an individual degree of belief that a given event will occur • Frequentist v.s. Bayesian – Toss a coin 10 times, get 7 head – In Frequentist, probability is P(A) = 7/10 – In Bayesian, I guess a probability P(A) = 0.5, then use this prior idea and the data to estimate probability Random variable • Mapping from property of objects to a variable that can take a set of possible values via a process that appears to the observer to have an element of unpredictability • Example – Coin toss (domain is the set [heads , tails]) – No of times a coin has to be tossed to get a head • Domain is integers – Student’s score • Domain is a set of integers between 0~100 Properties of single random variable • X is random variable and x is its value • Domain is finite: – probability mass function p(x) • Domain is real line: – probability density function f(x) • Expectation of X – E[ X ] xi p( xi ) i 1 – E[ X ] x f ( x )dx i i i Multivariate random variable • Set of several random variables • For p-dimensional vector x={x1,..,xp} • The joint mass function p( X 1 x1 , X p x p ) p( x1 ,, x p ) The joint mass function • For example – Rolling two fair dice, X represent first dice’s result and Y represent another – Then p(x=3, y=3) = 1/6 * 1/6 = 1/36 The joint mass function X=1 X=2 X=3 X=4 X=5 X=6 Px(X) Y=1 1/36 1/36 1/36 1/36 1/36 1/36 0.17 Y=2 1/36 1/36 1/36 1/36 1/36 1/36 0.17 Y=3 1/36 1/36 1/36 1/36 1/36 1/36 0.17 Y=4 1/36 1/36 1/36 1/36 1/36 1/36 0.17 Y=5 1/36 1/36 1/36 1/36 1/36 1/36 0.17 Y=6 1/36 1/36 1/36 1/36 1/36 1/36 0.17 Py(Y) 0.17 0.17 0.17 0.17 0.17 0.17 1.00 Marginal probability mass function • The marginal probability mass function of X and Y are f X ( x) p( X x) p XY ( x, y ) y fY ( y) p(Y y) p XY ( x, y) x Continuous • Marginal probability density function of X and Y are f X ( x) fY ( y ) f XY ( x, y )dy f XY ( x, y )dx Conditional probability • Density of a single variable (or a subset of complete set of variables) given (or “conditioned on”) particular values of other variables • Conditional density of X given some value of Y is denoted f(x|y) and defined as f ( x, y ) f ( x | y) f ( y) Conditional probability • For example – If a student’s score is given at random – Sample space is S = {0,1,…,100} – What’s the probability that the student is fail? • P( F ) F S 60 101 – Given that student’s score is even(including 0), then what’s the probability that the student is fail? • E 51 E F 30 P( F | E ) P( E F ) 30 / 101 P( E ) 51 / 101 Supermarket data Conditional independence • Generic problem in data mining is finding relationships between variables – Is purchasing item A likely to be related to purchasing item B? • Variables are independent if there is no relationship; otherwise they are dependent • Independent if p(x,y)=p(x)p(y) p( A B) p( A) p( B) p( A | B) p( A) p( B) p( B) p( A B) p( A) p( B) p( B | A) p( B) p( A) p( A) Conditional Independence: More than 2 variables • X is conditional independence of Y – Given Z if for all values of X, Y, Z we have – p ( x, y | z ) p ( x | z ) p ( y | z ) Conditional Independence: More than 2 variables • Example – P(F)=60/101 – P(E∩F)=30/51 – Now E and F are dependence – If student’s score !=100, then • • • • P(F|B)=60/100 P(E|B)=1/2 P(E∩F|B)=30/100=60/100*1/2 Given B condition,E and F are independence Conditional Independence: More than 2 variables • Example – If student’s score == 100,then • • • • P(F|C)=0 P(E|C)=1 P(E ∩ F|C)=0=1*0 Given C condition,E and F are independence – Now we can calculate P(E ∩ F) p( E F ) p( E F | grade 100) p( grade 100) p( E F | grade 100) p( grade 100) 100 0 0.5 0.6 101 30 101 Conditional Independence • Conditional independence don’t imply marginal independence p ( x, y | z ) p ( x | z ) p ( y | z ) not imply p ( x, y ) p ( x ) p ( y ) • Note that X and Y may be unconditionally independence but conditionally dependent given Z On assuming independence • Independence is a strong assumption frequently violated in practice • But provides modeling – Fewer parameters – Understandable models Dependence and Correlation • Covariance measures how X and Y vary together – Large positive if large X is associated with large Y, and small X with small Y – Negative if large X is associated with small Y • Two variables may be dependent but no linearly correlated Correlation and Causation • Two variables may be highly correlated without a causal relationship between the two – Yellow stained finger and lung cancer may be correlated but causally linked only by a third variable : smoking – Human reaction time and earned income are negatively correlated • Does not mean one causes the other • A third variable “age” is causally related to both Samples and Statistical inference • Samples can be used to model the data • If goal is to detect the small deviations form the data,the size of samples will effect the result Dual Role of Probability and Statistics in Data Analysis Outline • Random Variable • Estimate – Maximum Likelihood Estimation – Bayesian Estimation • Sampling Estimation • In inference we want to make statements about entire population from which sample is drawn • The two important methods for estimating parameters of a model – Maximum Likelihood Estimation – Bayesian Estimation Desirable properties of estimators • Let ˆ be an estimate of parameter • Two measures of estimator quality ˆ – Expected value of estimate (Bias) • Difference between expected and true value Bias (ˆ) E[ˆ] – Variance of Estimate Var (ˆ) E[ˆ E[ˆ]]2 Mean squared error • The mean of the squared difference between the value of the estimator and the true value of parameter E[(ˆ ) 2 ] • Mean squared error can be partitioned as sum of squared bias and variance E[(ˆ ) 2 ] ( Bias (ˆ)) 2 Var (ˆ) Mean squared error E[(ˆ ) 2 ] E[(ˆ E[ˆ] E[ˆ] ) 2 ] E[(ˆ E (ˆ)) 2 ( E (ˆ) ) 2 2( E[(ˆ E[ˆ])( E[ˆ] )])] Var (ˆ) Bias 2 (ˆ) 2 E[ˆE[ˆ] ˆ E[ˆ]E[ˆ] E[ˆ] ] Var (ˆ) Bias 2 (ˆ) 2[( E (ˆ)) 2 ( E (ˆ)) 2 E (ˆ) E (ˆ)] Var (ˆ) Bias 2 (ˆ) (a b b c) 2 (a b) 2 (b c) 2 2(a b)(b c) a 2 2ab b 2 b 2 2bc c 2 2ab 2ac 2b 2 2bc a 2 2ac b 2 E[ E[ X ]] E[ X ] E[c] c where c is a constant Maximum Likelihood Estimation • Most widely used method for parameter estimation • Likelihood Function is probability that data D would have arisen for a given value of θ L( | D) L( | x(1), , x(n)) p ( x(1), x(n) | ) n p ( x(i ) | ) i 1 • Value of θ for which the data has the highest probability is the MLE Example of MLE for Binomial • Customers either purchase or not purchase milk – We want estimate of proportion purchasing • Binomial with unknown parameterθ • Samples x(1),…,x(1000) where r purchase milk • Assuming conditional independence, likelihood function is 1000 L( | x(1), x(1000)) x (i ) (1 ) (1 x (i )) r (1 ) (1000 r ) i Log-likelihood Function • We want the highest probability,so change to Log-likelihood function l ( ) log L( ) r log( ) (1000 r ) log( 1 ) • Then Differentiating and setting equal to zero r log( ) (1000 r ) log( 1 ) r (1000 r ) d (1 ) r (1000 r ) Let 0 (1 ) r r 1000 r r 1000 Example of MLE for Binomial • r milk purchases out of n customers • θis the probability that milk is purchased by random customer • For 3 data set – r = 7,n =10 – r = 70,n =100 – r = 700,n =1000 • Uncertainty becomes smaller as n increases Example of MLE for Binomial Likelihood under Normal Distribution • For 1 variance,Unknown mean • Likelihood function n 1 L( | x(1), , x(n)) (2 ) 1/ 2 exp( ( x(i ) ) 2 ) 2 i 1 (2 ) n / 2 1 n exp( ( x(i ) ) 2 ) 2 i 1 ( x )2 f ( x) e 2 2 2 2 1 Log-likelihood function n 1 n l ( | x(1), , x(n)) log 2 ( x(i ) ) 2 2 2 i 1 • To find the MLE set derivative d/dθ to zero dl ( ) n ( x(i ) ) d i 1 n Let ( x(i ) ) 0 i 1 n n i 1 i 1 ( x(i ) ) x(i ) n 0 n x(i) i 1 n Likelihood under Normal Distribution • θis the estimated mean • For 2 data set(By random) – 20 data points – 200 data points Likelihood under Normal Distribution Sufficient statistic • Quantity s(D) is a sufficient statistic forθ if the likelihood l(θ) only depends on the data through s(D) • no other statistic which can be calculated from the same sample provides any additional information as to the value of the parameter Interval estimate • Point estimate doesn’t convey uncertainty associated with it • Interval estimate provide a confidence interval Likelihood under Normal Distribution Normal distributi on : ( x )2 f ( x) exp( ) 2 2 2 2 Likelihood function : 1 n 2 2 L( | x1 , xn ) (2 ) exp( n 1 2 2 2 ( x ) ) i i 1 Log likelihood function : n 1 2 l ( | x1 , xn ) log( 2 ) 2 2 2 n 2 ( x ) i i 1 Mean 1 n l ( | x1 , xn ) log( 2 2 ) 2 2 2 dl ( | x1 , xn ) 1 2 d 1 2 n (x ) (x ) 0 i i 1 n ( xi ) 0 i 1 n ( xi ) n 0 i 1 n (x ) i i 1 n n i 1 i n (x ) i 1 i 2 Variance n 1 2 l ( | x1 , xn ) log( 2 ) 2 2 2 n 2 ( x ) i i 1 dl ( | x1 , xn ) n 1 1 2 ( ) 2 2 d 2 2 2 4 n 1 1 2 ( ) 2 2 2 2 4 n 2 2 1 2 4 n 2 ( x ) i i 1 1 n ( xi ) 2 n i 1 2 n n 2 ( x ) i i 1 2 ( x ) 0 i i 1 Outline • Random Variable • Estimate – Maximum Likelihood Estimation – Bayesian Estimation • Sampling Bayesian approach • Frequestist approach – The parameters of population are fixed but unknown – Data is a random sample – Intrinsic variability lies in data D {x(1), , x(n)} • Bayesian approach – – – – Data are known Parameters θ are random variables θhas a distribution of values p ( ) reflects degree of belief on where true parameters θ may be Bayesian estimation • Modification done by Bayesian rule p( | D) p( D | ) p( ) p( D) p( D | ) p( ) p( D | ) p( )d • Leads to a distribution rather than single value – Single value can be obtained by mean or mode Bayesian estimation • P(D) is a constant independent of θ • For a given data set D and a particular model(model = distribution for prior and likelihood) p( | D) p( D | ) p( ) • If we have a weak belief about parameter before collecting data, choose a wide prior(normal with large variance) Binomial example • Single binary variable X : wish to estimate p ( X 1) • Prior for parameter in [0, 1] is the Beta distribution p( ) ( 1) (1 ) ( 1) where 0 and 0 are two parameters of this model ( ) ( 1) Beta( | , ) (1 ) ( 1) ( )( ) Binomial example • Likelihood function L( | D) (1 ) • Combining likelihood and prior r ( n r ) p( | D) p( D | ) p( ) r (1 ) nr ( 1) (1 )( 1) ( r 1) (1 )( nr 1) • We get another Beta distribution – With parameters r and n r Beta(5,5) and Beta(145,145) Beta(5,5) Beta(45,50) Advantages of Bayesian approach • Retain full knowledge of all problem uncertainty • Calculating full posterior distribution onθ • Natural updating of distribution p( | D1 , D2 ) p( D2 | ) p( D1 | ) p( ) Predictive distribution • In equation to modify prior to posterior p( | D) p( D | ) p( ) p( D) p( D | ) p( ) p( D | ) p( )d • Denominator is called predictive distribution of D • Useful for model checking – If observed data have only small probability then it is unlikely to be correct Normal distribution example • Suppose x comes from a normal distribution With unknown mean θand known variance α • Prior distribution for θis ~ N (0 , 0 ) Normal distribution example p ( | x) p ( x | ) p ( ) 1 1 1 1 2 exp( (x ) ) exp( ( 0 ) 2 ) 2 2 0 2 20 1 2 1 1 x x 2 0 exp( ( ) ( ) ) 2 0 0 2 2 0 2 1 let 1 ( 0 1) 1 1 1 ( 0 x ) 0 1 2 1 exp( / 1 1 / 1 ) exp( ( 1 ) 2 / 1 ) 2 2 1 1 p ( | x) exp( ( 1 ) 2 / 1 ) 2 21 Jeffrey’s prior • A reference prior • Fisher information log L( | x) I ( | x) E[ ] 2 2 • Jeffrey’s prior p( ) I ( | x) Conjugate priors • p(θ) is a conjugate prior for p(x| θ) if the posterior distribution p(θ|x) is in the same family as the prior p(θ) – Beta to Beta – Normal distribution to Normal distribution Outline • Random Variable • Estimate – Maximum Likelihood Estimation – Bayesian Estimation • Sampling Sampling in Data Mining • The data set is only fit statistical analysis – “Experimental design” in statistics is concerned with optimal ways of collecting data – Data miners can’t control the data collection process – The data may be ideally suited to the purposes for which it was collected, but not adequate for its data mining uses Sampling in Data Mining • Two ways in which sample arise – Database is sample of population – Database contains every cases, but the analysis is based on the sample • Not appropriate when we want to find unusual records Why sampling • Draw a sample from the database that allows us to construct a model reflects the structure of the data in the database – Efficiency, quicker, easier • The sample must representative of the entire database Systematic sampling • Try to ensure representativeness – Taking one out of every two records • Can lead to problems when there are regularities in database – Data set where records are of married couples Random Sampling • Avoiding regularities • Epsem Sampling – Each record has same probability of being chosen Variance of Mean of Random Sample • If variance of population of size N is , the variance of mean of a simple random sample of size n without replacement is (1 n ) 2 2 n N – Usually N >> n, so the second term is small, and variance decreases as sample size increases Example • 2000 points, population mean = 0.0108 • Random sample n = 10, 100, 1000, repeat 200 times Example Stratified Random Sampling • Split population into non-overlapping subpopulations or strata • Advantages – Enable making statements about each of the subpopulations separately • For example, one of the credit card companies we work with categorizes transactions into 26 categories : supermarket, gas station, and so on Mean of Stratified Sample • The total size of population is N th • k stratum has N k elements in it • nk are chosen for the sample from this stratum th • Sample mean within k stratum is xk • Estimate of population mean N k xk N Cluster Sampling • Every cluster contains many elements • Simple random sample on elements is not appropriate • Select cluster, not element