Download python-bayesnets

Aspects of Bayesian Inference and Statistical Disclosure Control in Python Duncan Smith Confidentiality and Privacy Group CCSR University of Manchester Introduction  Bayesian Belief Networks (BBNs) probabilistic inference  Statistical Disclosure Control (SDC) deterministic inference (attribution) Bayesian Belief Networks  Decision-making in complex domains  Hard and soft evidence  Correlated variables  Many variables Bayes’ Rule P A B   PB A P B  P  A A prior belief and evidence combined to give a posterior belief Venn Diagram Event B Event A A only Both A&B Neither A nor B B only Inference 1. Prior probability table P(A) a a 0.7 0.3 2. Conditional probability table P(B|A) a a b 3/7 2/3 b 4/7 1/3 1 1 3. Produce joint probability table by multiplication b b a a 0.3 0.2 0.4 0.7 0.1 0.3 4. Condition on evidence b a a 0.3 0.2 0.5 0.5 1 5. Normalise table probabilities to sum to 1 b a a 0.6 0.4 def Bayes(prior, conditional, obs_level): """Simple Bayes for two categorical variables. 'prior' is a Python list. 'conditional' is a list of lists (‘column’ variable conditional on ‘row’ variable). 'obs_level' is the index of the observed level of the row variable""" levels = len(prior) # condition on observed level result = conditional[obs_level] # multiply values by prior probabilities result = [result[i] * prior[i] for i in range(levels)] # get marginal probability of observed level marg_prob = sum(result) # normalise the current values to sum to 1 posterior = [value / marg_prob for value in result] return posterior Note: conditioning can be carried out before calculating the joint probabilities, reducing the cost of inference >>> A = [0.7, 0.3] >>> B_given_A = [[3.0/7, 2.0/3], [4.0/7, 1.0/3]] >>> Bayes(A, B_given_A, 0) [0.59999999999999998, 0.39999999999999997] >>>  The posterior distribution can be used as a new prior and combined with evidence from further observed variables  Although computationally efficient, this ‘naïve’ approach implies assumptions that can lead to problems Naive Bayes P A, B ,C   PC APB AP A A B C A ‘correct’ factorisation P A, B ,C   PC A, B PB AP A A B C Conditional independence  The Naive Bayes example assumes: PC A, B   PC A  But if valid, the calculation is easier and fewer probabilities need to be specified The conditional independence implies that if A is observed, then evidence on B is irrelevant in calculating the posterior of C A Bayesian Belief Network R W  S H R and S are independent until H is observed A Markov Graph R W  S H The conditional independence structure is found by marrying parents with common children Factoring  The following factorisation is implied PH , R, S ,W   PW R PR PS PH R, S   So P(S) can be calculated as follows (although there is little point, yet) PS   PS  PR  PH R, S  PW R  R H W  If H and W are observed to be in states h and w, then the posterior of S can be expressed as follows (where epsilon denotes ‘the evidence’) PS    PS  PR  PH  h R, S  PW  w R  R Graph Triangulation A B A B A B D C D C D C E E E Belief Propagation R,H,S R W,R  Message passing in a Clique Tree S R,S R,H,S  W,R Message passing in a Directed Junction Tree A Typical BBN Belief Network Summary  Inference requires a decomposable graph  Efficient inference requires a good  decomposition Inference involves evidence instantiation, table combination and variable marginalisation Statistical Disclosure Control  Releases of small area population (census) data  Attribution occurs when a data intruder can make inferences (with probability 1) about a member of the population Lawyer Profession Accountant Col sum   Department A B C 18 4 2 2 3 0 20 7 2 Row sum 24 5 29 Negative Attribution - An individual who is an accountant does not work for Department C Positive Attribution - An individual who works in Department C is a lawyer  Release of the full table is not safe from an attribute disclosure perspective (it contains a zero)  Each of the two marginal tables is safe (neither contains a zero)  Is the release of the two marginal tables ‘jointly’ safe? The Bounds Problem  Given a set of released tables (relating to the same population), what inferences about the counts in the ‘full’ table can be made?  Can a data intruder derive an upper bound of zero for any cell count? A non-graphical case A B C  All 2 × 2 marginals of a 2×2×2 table  A maximal complete subgraph (clique) without an individual corresponding table Var1 Var1 Var2 A B Var3 A B C 3 9 E 1 10 D 2 2 F 4 1 Var2 Var1 and Var2 Var3 C D Var3 A, C A, D B, C B, D E 8 3 E 0 1 8 2 F 4 1 F 3 1 1 0  Original cell counts can be recovered from the marginal tables Lawyer Profession Accountant Col sum  Department A B C 20 7 2 5 5 2 20 7 2 Row sum 24 5 29 Each cell’s upper bound is the minimum of it’s relevant margins (Dobra and Fienberg) SDC Summary  A set of released tables relating to a given population  If the resulting graph is both graphical and decomposable, then the upper bounds can be derived efficiently Common aspects  Graphical representations Graphs / cliques / nodes / trees  Combination of tables Pointwise operations  BBNs pointwise multiplication  SDC and pointwise minimum pointwise addition pointwise subtraction } For calculating exact lower bounds Coercing Numeric built-ins  A table is a numeric array with an associated list of variables  Marginalisation is trivial, using the built-in Numeric.add.reduce() function and removing the relevant variable from the list  Conditioning is easily achieved using a Numeric.take() slice, appropriately reshaping the array with Numeric.reshape() and removing the variable from the list  Pointwise multiplication  Numeric.multiply() generates the appropriate table IF the two tables have identical ranks and variable lists  This is ensured by adding new axes (Numeric.NewAxis) for the ‘missing’ axes and transposing one of the tables (Numeric.transpose()) so that the variable lists match array([24, 5]) ['profession'] (2,) array([20, 7, 2]) ['department'] (3,) array([[24], [ 5]]) ['profession', 'department'] array([[20, 7, 2]]) ['profession', 'department'] (2, 1) (1, 3) >>> prof * dept array ([[480, 168, 48], [100, 35, 10]]) ['profession', 'department'] >>> (prof * dept).normalise(29) array([[ 16.551, 5.793, 1.655], [ 3.448, 1.206, 0.344]]) ['profession', 'department']  Pointwise minimum / addition / subtraction  Numeric.minimum(), Numeric.add() and Numeric.subtract() generate the appropriate tables IF the two tables have identical ranks and variable lists AND the two tables also have identical shape  This is ensured by a secondary preprocessing stage where the tables from the first preprocessing stage are multiplied by a ‘correctly’ shaped table of ones (this is actually quicker than using Numeric.concatenate()) array([[24], [ 5]]) ['profession', 'department'] array([ [20, 7, 2]]) ['profession', 'department'] array([[20, 7, 2] [20, 7, 2]]) (2, 1) (1, 3) (2nd stage preprocessing) (2,3) >>> prof.minimum(dept) array([[20, 7, 2], [ 5, 5, 2]]) ['profession', 'department'] Summary  The Bayesian Belief Network software was originally implemented in Python for two reasons 1. The author was, at the time, a relatively inexperienced programmer 2. Self-learning (albeit with some help) was the only option  The SDC software was implemented in Python because, 1. Python + Numeric turned out to be a wholly appropriate solution for BBNs (Python is powerful, Numeric is fast) 2. Existing code could be reused

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download python-bayesnets