Download Linking recombinant genes sequence to protein

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Copy-number variation wikipedia , lookup

Genetic code wikipedia , lookup

Transposable element wikipedia , lookup

Pathogenomics wikipedia , lookup

Public health genomics wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Ridge (biology) wikipedia , lookup

Genomic imprinting wikipedia , lookup

Protein moonlighting wikipedia , lookup

Neuronal ceroid lipofuscinosis wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Genetic engineering wikipedia , lookup

Gene therapy wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

History of genetic engineering wikipedia , lookup

Minimal genome wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

Genome evolution wikipedia , lookup

Gene desert wikipedia , lookup

Genome editing wikipedia , lookup

RNA-Seq wikipedia , lookup

Gene expression programming wikipedia , lookup

Point mutation wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Gene nomenclature wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Gene wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Genome (book) wikipedia , lookup

Helitron (biology) wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Microevolution wikipedia , lookup

Gene expression profiling wikipedia , lookup

NEDD9 wikipedia , lookup

Designer baby wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Transcript
Linking recombinant genes sequence to
protein products
Javier González
Join work with Neil D. Lawrence, David James, Joseph
Longworth
23, February 2015
The big picture
“Civilization advances by extending the
number of important operations which
we can perform without thinking of them.”
(Alfred North Whitehead)
Project background
I
Use cells as factories to produce compounds of drugs.
Research agenda
I
Optimization of proteins production using hamster cells.
Minimize the number of lab experiments!
I
Development of new Machine Learning techniques.
Motivation
I
Synthetic biology: re-design of elements of living systems for
useful purposes
I
Industrial interest: design of synthetic genes to engineer living
cells to produce compounds of interest.
I
Optimization goal: rewrite gene sequences to optimize protein
production (increase transcription and translation rates).
The problem in a nutshell
Proteins: Fundamental bricks of the cell.
Genes: Information to synthesize proteins (1 protein → 1 gene)
Gene sequence
ATGCTGCAGATGTGGGGGTTTGTTCTCTATCTCTTCCTGAC
TTTGTTCTCTATCTCTTCCTGACTTTGTTCTCTATCTCTTC...
Considerations
I
Different gene sequences → same protein.
I
The sequence affects the synthesis efficiency.
Which is the most efficient sequence to produce a protein?
Redundancy of the genetic code
I
Codon: Three consecutive bases: AAT, ACG, etc.
I
Protein: sequence of amino acids.
I
Different codons may encode the same aminoacid.
I
ACA=ACU encodes for Threonine.
ATUUUG ACA = ATUUUG ACU
synonyms sequences. → same protein but different efficiency
‘Complexity’ of the genetic code
How to design a synthetic gene?
Very complex cell behavior!
Gene sequence (features) → protein production efficiency
Bayesian Optimization (BO)
do:
1. Model as an emulator of the cell behavior.
2. Obtain a set of gene design rules.
3. Design a new gene coherent with the design rules.
4. Repeat experiment (get new data).
until the gene is optimized (or the budget is over...)
1. Model as an emulator of the cell behavior
Model inputs:
I
Set of gene sequences from which we extract some biologically
relevant features: codon frequency, cai, gene length, etc.
Model outputs:
I
Transcription and translation rates.
Model type:
I
Multi-output Gaussian process model.
2. Obtain a set of gene design rules
The model is used to determine which are the best gene design
rules x ? .
3. Design a new gene coherent with the desing rules
1. Simulate a set of genes coherent with the one we want to
design.
2. Extract the features from all the simulated genes.
3. Rank the generated genes according to their similarity to the
‘optimal’ design rules.
4. Pick up the best one and test it in the lab.
Experiments
I
Optimization gene designs in mammalian cells.
I
Dataset in Schwanhausser et al. (2011) for 3810 genes rates.
Associated sequences were extracted from
http://www.ensembl.org.
I
Features: frequency of appearance the 64 codons, length of
the gene, GC-content, AT-content, GC-ratio and AT-ratio.
I
Selection of 10 random difficult-to-express genes (average log
ratio < 1.5).
I
1,000 random ‘synonyms sequences’ generated from each
gene.
Prediction of the translation rates (in a test sample of
1500 genes)
correlation ≈ 0.6
Results for 10 low-expressed genes
Conclusions and future work
I
Bayesian optimization is a promising technique to design
synthetic genes. Reduces the number experiments!!
I
Important aspects of the problem → to have a good surrogate
model for the cell behavior.
I
More features → better gene design → development of
scalable Bayesian optimization methods able to work well in
high dimensions.
I
Alternative approach: focus on the optimization of the
sequences. Combinatorial problem.