* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Linking recombinant genes sequence to protein
Copy-number variation wikipedia , lookup
Genetic code wikipedia , lookup
Transposable element wikipedia , lookup
Pathogenomics wikipedia , lookup
Public health genomics wikipedia , lookup
Epigenetics of diabetes Type 2 wikipedia , lookup
Ridge (biology) wikipedia , lookup
Genomic imprinting wikipedia , lookup
Protein moonlighting wikipedia , lookup
Neuronal ceroid lipofuscinosis wikipedia , lookup
Epigenetics of neurodegenerative diseases wikipedia , lookup
Biology and consumer behaviour wikipedia , lookup
Genetic engineering wikipedia , lookup
Gene therapy wikipedia , lookup
Polycomb Group Proteins and Cancer wikipedia , lookup
History of genetic engineering wikipedia , lookup
Minimal genome wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Gene therapy of the human retina wikipedia , lookup
Genome evolution wikipedia , lookup
Gene desert wikipedia , lookup
Genome editing wikipedia , lookup
Gene expression programming wikipedia , lookup
Point mutation wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Gene nomenclature wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Genome (book) wikipedia , lookup
Helitron (biology) wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Microevolution wikipedia , lookup
Gene expression profiling wikipedia , lookup
Linking recombinant genes sequence to protein products Javier González Join work with Neil D. Lawrence, David James, Joseph Longworth 23, February 2015 The big picture “Civilization advances by extending the number of important operations which we can perform without thinking of them.” (Alfred North Whitehead) Project background I Use cells as factories to produce compounds of drugs. Research agenda I Optimization of proteins production using hamster cells. Minimize the number of lab experiments! I Development of new Machine Learning techniques. Motivation I Synthetic biology: re-design of elements of living systems for useful purposes I Industrial interest: design of synthetic genes to engineer living cells to produce compounds of interest. I Optimization goal: rewrite gene sequences to optimize protein production (increase transcription and translation rates). The problem in a nutshell Proteins: Fundamental bricks of the cell. Genes: Information to synthesize proteins (1 protein → 1 gene) Gene sequence ATGCTGCAGATGTGGGGGTTTGTTCTCTATCTCTTCCTGAC TTTGTTCTCTATCTCTTCCTGACTTTGTTCTCTATCTCTTC... Considerations I Different gene sequences → same protein. I The sequence affects the synthesis efficiency. Which is the most efficient sequence to produce a protein? Redundancy of the genetic code I Codon: Three consecutive bases: AAT, ACG, etc. I Protein: sequence of amino acids. I Different codons may encode the same aminoacid. I ACA=ACU encodes for Threonine. ATUUUG ACA = ATUUUG ACU synonyms sequences. → same protein but different efficiency ‘Complexity’ of the genetic code How to design a synthetic gene? Very complex cell behavior! Gene sequence (features) → protein production efficiency Bayesian Optimization (BO) do: 1. Model as an emulator of the cell behavior. 2. Obtain a set of gene design rules. 3. Design a new gene coherent with the design rules. 4. Repeat experiment (get new data). until the gene is optimized (or the budget is over...) 1. Model as an emulator of the cell behavior Model inputs: I Set of gene sequences from which we extract some biologically relevant features: codon frequency, cai, gene length, etc. Model outputs: I Transcription and translation rates. Model type: I Multi-output Gaussian process model. 2. Obtain a set of gene design rules The model is used to determine which are the best gene design rules x ? . 3. Design a new gene coherent with the desing rules 1. Simulate a set of genes coherent with the one we want to design. 2. Extract the features from all the simulated genes. 3. Rank the generated genes according to their similarity to the ‘optimal’ design rules. 4. Pick up the best one and test it in the lab. Experiments I Optimization gene designs in mammalian cells. I Dataset in Schwanhausser et al. (2011) for 3810 genes rates. Associated sequences were extracted from http://www.ensembl.org. I Features: frequency of appearance the 64 codons, length of the gene, GC-content, AT-content, GC-ratio and AT-ratio. I Selection of 10 random difficult-to-express genes (average log ratio < 1.5). I 1,000 random ‘synonyms sequences’ generated from each gene. Prediction of the translation rates (in a test sample of 1500 genes) correlation ≈ 0.6 Results for 10 low-expressed genes Conclusions and future work I Bayesian optimization is a promising technique to design synthetic genes. Reduces the number experiments!! I Important aspects of the problem → to have a good surrogate model for the cell behavior. I More features → better gene design → development of scalable Bayesian optimization methods able to work well in high dimensions. I Alternative approach: focus on the optimization of the sequences. Combinatorial problem.