* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Bayesian Networks for Genome Expression: A Bayesian
Survey
Document related concepts
Transcript
A Bayesian Statistical Approach to Modeling Gene Regulatory Pathways in Human Placental Data Elinor Velasquez Dept. of Biology San Francisco State University Outline of talk • Introduction • The experimental approach: Obtaining placenta data • The experimental approach: Modeling gene regulatory networks • Results from experiments • Conclusions and future work • Acknowledgements Introduction Overall goal To use a bioinformatics model for which to better understand the human placenta http://www.biotechnologycenter.org/hio/assets/hisimages/placenta/placenta44.jpg The human placenta http://www.uchsc.edu/winnlab/index.html The basal plate in the placenta Site of known anatomical abnormalities in preeclampsia http://www.uchsc.edu/winnlab/projects.html EGFR pathway • EGFR, cell surface receptor for epidermal growth factors • Potentially important gene for the placenta British Journal of Cancer (2006) 94, 184 – 188 EGFR regulates gene expression EGFR ANGPT2 CSPG2 DCN Causal relationships EGFR ANGPT2 CSPG2 DCN Example of a gene regulatory network Gene 1 Gene 2 Gene 4 Gene 3 Gene 5 Gene 6 Definition of a Bayesian network • There exist nodes (disks) • There are edges (arrows) between some of the nodes • Causality is implied by the edges • Acyclic Gene 1 Gene 2 Gene 4 Gene 3 Gene 5 Gene 6 The experimental approach: Obtaining placenta data Data collected from microarrays cRNA • Data comes from 36 experiments conducted by Virginia Winn et al. at the SJ Fisher lab, UCSF • Gene expression profiling experiments hybridization 45000 dots (25-mer oligo probe sets) representing the human genome Traditional “spotted” arrays What is a probe set? • Several oligonucleotides designed to hybridize to various parts of the mRNA generated from a single gene Probe set mRNA gene Affymetrix GeneChips Microarray data The normalized log 2 intensity values were centered to the median value of each probe set, by Virginia Winn et al. 5 time segments: 1 A probe set 2 3 4 5 x1 ... x6 y1 ... y9 z1 ...z6 w1...w6 s1 ... s9 36 data points per probe set Microarray data • Red denotes the up regulated expression and green denotes the down regulated expression relative to the median value • Genes differentially expressed in the basal plate of placentas: Rows contain data from a single basal plate cRNA sample and columns correspond to a single probe set. http://www.uchsc.edu/winnlab/index.html Summary of data used in bioinformatics experiments Average gene expression value • 36 placentas • 45, 000 probe sets • Time-series data from 14-16 weeks to term Gene egfr 9.8 9.6 9.4 9.2 9 8.8 8.6 8.4 8.2 8 7.8 14 - 16 18 - 19 21 Weeks 23 - 24 37 - 40 The experimental approach: Modeling gene regulatory networks Outline of bioinformatics experimental design PS 1 PS 2 PS 3 PS 4 Step 1. Create a naïve Bayesian network using the probe set data Step 2. Score the naïve Bayesian network Step 3. Randomly add/delete an edge and rescore the Bayesian network Step 4. Continue until best score reached Step 5. Combine probe sets to create the gene regulatory network Four probe sets (Three genes) Define naïve Bayesian network • Choose a root node • All other nodes branch off of the root node • PS1 is the parent node PS 1 PS 2 PS 3 PS 4 Step 1: Create a naïve Bayesian network using probe set data PS1 PS2 PS3 PS4 • Use data from one time segment • Choose Weeks 23-24 data (6 placentas) • Choose 4 probe sets Placenta data for Weeks 23-24 PS1 corresponds to 201984 which corresponds to EGFR PS2 corresponds to 236034, PS3 corresponds to 211148: PS2 and PS3 both correspond to ANGPT2 PS4 corresponds to 204620 which corresponds to CSPG2 Step 2: Score the naïve Bayesian network • We want to score this network: PS1 PS2 PS4 PS3 The network score is a function of conditional probabilities • Conditional probability, Prob(N | Pa(N)), is the probability of child node N given parent of N • Example: Given a parent PS1’s node has an associated expression value 10, what is the probability that its child node, PS4, has an expression value of 8? PS1 PS4 Conditional probability PS1 • EGFR (PS1) is the parent node and has value 10. • CSPG2 (PS4) is the child node and has value 8 two times • Conditional probability = 2/6 PS4 Score for a Bayesian network The score of the naive network equals the product of all the nonzero conditional probabilities associated with the network: 4 P(N1, N2, N3, N4) = Π P(Ni | pa(Ni)) i=1 Score for the naïve Bayesian network P(N1, N2, N3, N4) = 1/3966 = 2.54 x 10-5 PS1 PS2 PS4 PS3 Step 3: Randomly add/delete an edge and rescore the Bayesian network PS1 PS2 PS4 The score becomes 1/78732 = 1.27 x 10-5. PS3 Step 4. Continue until best score reached • Since the score is a probability, we want the score to be high. • The naïve network is the better choice between the two networks, so we pick it as our final network. PS1 PS2 PS4 PS3 Step 5. Combine probe sets to create the gene regulatory network EGFR ANGPT2 CSPG2 40 probe sets (26 genes) Gene regulatory pathway for 26 genes Step 1. Create a naïve Bayesian network using 40 probe sets for each time segment Step 2. Score the naïve Bayesian network Step 3. Randomly add/delete an edge and rescore the Bayesian network Step 4. Continue until best score reached Step 5. Combine probe sets to create the gene regulatory network for the placenta Step 1. Create a naïve Bayesian network using 40 probe sets for each time segment Create a naïve Bayesian network PS 7 PS 8 PS 9 PS 6 PS 1 PS 2 PS 3 PS 5 PS 4 Step 2. Score the naïve Bayesian network Score for a Bayesian network The score of the naive network equals the product of all the nonzero conditional probabilities associated with the network: 40 P(N1, N2, N3, N4) = Π P(Ni | pa(Ni)) i=1 Step 3. Randomly add/delete an edge and rescore the Bayesian network Step 4. Continue until best score reached With four probe sets, at least two Bayesian networks were constructed: PS1 PS1 PS2 PS2 PS4 PS3 PS4 PS3 Exhaustive search • To be certain that we have the best scoring network, we need to construct all possible networks from our naïve networks • With four probe sets, we only constructed one other network than the naïve network • How to construct all possible networks? How do we construct all possible networks? • • • • • • • 1 probe set 1 Bayesian network 2 probe sets 2 possible Bayesian networks 3 probe sets 12 possible Bayesian networks 4 probe sets 144 possible Bayesian networks 5 probe sets > 4800 possible Bayesian networks! 6 probe sets … ?? And so on… Welcome to “Modern Heuristics” • • • • Step 1. Representation of a model Step 2. The scoring function Step 3. Defining the search problem Step 4. Consider local optima score local change Step 1: Representation of the model • The model is a gene regulatory pathway. • We are going to assume a Bayesian model for our probe set: PS 1 PS 2 PS 3 PS 4 • The number of possible pathways is so large as to forbid an exhaustive search for the best Bayesian network. Step 2: The scoring function • The fair coin, p(X = heads) = ½ • What happens if the coin is unfairly weighted? • We need to re-think probability: ∫ p(X) = p(x) r(x) dx • r(x) is a weight function. Step 2. The scoring function • The scoring function is a probability • Assume the network has a Dirichlet distribution which is the weight function used to weight the conditional probabilities. www.wikipedia.com Step 2. The scoring function Probability of a fixed network equals product of conditional probabilities times the Dirichlet distribution: 40 P(N) = Π P(Ni | pa(Ni)) D(Ni) i=1 such that D(Ni) = ∏ Θiάi-1(N i) Step 3: Defining the search problem What it means to search: a. Construct a first network (Use a naïve Bayesian network) b. Score the first network using the scoring function c. Perform the Hill-climbing algorithm. Step 3. Defining the search problem The Hill-climbing Algorithm: • Randomly choose a node • “Search” in the neighborhood of that node for the best scoring network Step 4. Consider local optima score • Hill-Climbing is a traditional method for search techniques local • Can get caught on local maxima • Step 4 is to keep choosing random nodes. change randomly chosen node is the origin From http://content.answers.com/ Software • Weka software package written by members of the University of Waikato, New Zealand, http://www.cs.waikato.ac.nz/~ml/people.html • DEAL, R package, written by Susanne G. Bøttcher, Claus Dethlefsen, http://www.math.auc.dk/novo/deal • BayesNet Toolbox, Matlab package, written by Kevin Murphy, http://www.cs.ubc.ca/~murphyk/Software/BNT/bnt.html • ExpressionNet, written by Jingchun Zhu, http://expressionnet.sourceforge.net/ Results from experiments 26 genes COL5A1 COL3A1 COL5A2 DCN COL1A2 CSPG2 SPP1 INHBA ANGPT2 BAMBI SFRP1 P4HA1 IGFBP1 RAP2B PLAU CCNG2 MRC2 GLB1 ATP5E ADAM9 EGFR ERG USP6NL PECAM1 IL2RB CECAM1 CYP19A1 Ingenuity network Results for 26 genes • 40 probe sets (26 genes) • Data comes from five different time intervals: 1. 14 – 16 gestational weeks 2. 18 – 19 gestational weeks 3. 21 gestational week 4. 23 – 24 gestational weeks 5. 37 – 40 gestational weeks COL5A1 COL5A2 COL3A1 DCN COL1A2 CSPG2 SPP1 INHBA ANGPT2 BAMBI SFRP1 P4HA1 IGFBP1 RAP2B PLAU CCNG2 MRC2 GLB1 ATP5E ADAM9 EGFR ERG USP6NL PECAM1 IL2RB CECAM1 CYP19A1 Time Segment: Week 14-16 weeks COL5A1 COL5A2 COL3A1 DCN COL1A2 CSPG2 SPP1 INHBA ANGPT2 BAMBI SFRP1 P4HA1 IGFBP1 RAP2B PLAU CCNG2 MRC2 GLB1 ATP5E ADAM9 EGFR ERG USP6NL PECAM1 IL2RB CECAM1 CYP19A1 Time segment: 18 – 19 weeks COL5A1 COL5A2 COL3A1 DCN COL1A2 CSPG2 SPP1 INHBA ANGPT2 BAMBI SFRP1 P4HA1 IGFBP1 RAP2B PLAU CCNG2 MRC2 GLB1 ATP5E ADAM9 EGFR ERG USP6NL PECAM1 IL2RB CECAM1 CYP19A1 Time segment: 21 weeks COL5A1 COL5A2 COL3A1 DCN COL1A2 CSPG2 SPP1 INHBA ANGPT2 BAMBI SFRP1 P4HA1 IGFBP1 RAP2B PLAU CCNG2 MRC2 GLB1 ATP5E ADAM9 EGFR ERG USP6NL PECAM1 IL2RB CECAM1 CYP19A1 Time segment: 23 – 24 weeks COL5A1 COL5A2 COL3A1 DCN COL1A2 CSPG2 SPP1 INHBA ANGPT2 BAMBI SFRP1 P4HA1 IGFBP1 RAP2B PLAU CCNG2 MRC2 GLB1 ATP5E ADAM9 EGFR ERG USP6NL PECAM1 IL2RB CECAM1 CYP19A1 Time segment: 37 – 40 weeks How to display data • One of the most pressing questions in bioinformatics research is how to display the data effectively • We have two solutions 1. An interaction map 2. Geometrical considerations An interaction map for 26 genes Geometrical considerations • Will illustrate with the gene egfr • egfr is an epidermal growth factor Functions on the cell surface Activated by binding of its specific ligands Responsible for many pathways in animal models Gene egfr regulated by: Genes on a dodecahedron: Gene regulatory network for egfr CSPG2 CCNG2 COL1A2 PLAU INHBA On backside: PECAM1 ANGPT2 IGFBP1 MRC2 SPP1 USP6NL DCN Adapted from http://www.math.cornell.edu/~mec/2003-2004/geometry/platonic/dodecahedron.jpg Conclusions • We can predict gene regulatory networks using Bayesian networks as an intermediate step • When we leave arrows in network, we are able to show causal relationships between the genes • Interaction maps and use of geometry are novel ways to display gene behavior Future Directions • A three-dimensional viewer with numerical values will be implemented to use with the Weka software • Use molecular genetics techniques to validate a portion of the results • Design a genetic programming algorithm (evolutionary algorithm) to create a Bayesian network Acknowledgements San Francisco State University: Leticia Márquez-Magaña, Chris Smith, Frank Bayliss, Juan Castellon, Ernesto Flores, Rebecca Garcia, Alba Gutierrez, Jainee Lewis, Rebecca Mendez, Cylyn Cruz, Jasmin Reyes, Jackie Robinson, Peter Thorsen, My family UC San Francisco: Susan Fisher, Matthew Gormley M.B.R.S.-R.I.S.E. Grant 5 - R25-GM59298