* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download file1
Quantitative trait locus wikipedia , lookup
Gene therapy wikipedia , lookup
Essential gene wikipedia , lookup
Public health genomics wikipedia , lookup
History of genetic engineering wikipedia , lookup
Pathogenomics wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Gene nomenclature wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Genomic imprinting wikipedia , lookup
Gene desert wikipedia , lookup
Genome evolution wikipedia , lookup
Ridge (biology) wikipedia , lookup
Minimal genome wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Biology and consumer behaviour wikipedia , lookup
Genome (book) wikipedia , lookup
Microevolution wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Gene expression programming wikipedia , lookup
Reverse engineering gene networks using singular value decomposition and robust regression M.K.Stephen Yeung Jesper Tegner James J. Collins General idea Reverse-engineer: • Genome-wide scale • Small amount of data • No prior knowledge • Using SVD for a family of possible solutions • Using robust regression to choose from them If the system is near a steady state, dynamics can be approximated by linear system of N ODEs: N xi (t ) i xi (t ) Wij x j (t ) bi (t ) i (t ) j 1 xi = concentration of mRNA (reflects expression level of genes) λi = self-degradation rates bi = external stimuli ξi = noise Wij = type and strength of effect of jth gene on ith gene Suppositions made: • No time-dependency in connections (so W is not time-dependent), and they are not changed by the tests • System near steady state • Noise will be discarded, so exact measurements are assumed • X can be calculated exactly enough In M experiments with N genes, • each time apply stimuli (b1,…,bN) to the genes • measure concentrations of N mRNAs (x1,…,xN) using a microarray You get: X N xM x11 x12 x1M x12 x22 x2M x1 x 2 x M N N N subscript i = mRNA number superscript j = experiment number Goal is to use as few measurements as possible. By this method (with exact measurements): M = O(log(N)) e.g. in 1st test, the results will be: System becomes: X N x M AN x N X N x M BN x M With A = W + diag(-λi) Compute X by using several measurements of the data for X. (e.g. using interpolation) Goal = deduce W (or A) from the rest X A T T M xN NxN T X M xN B T If M=N, compute (XT)-1, but mostly M << N (this is our goal: M = log(N)) M xN Therefore, use SVD (to find least squares sol.): X T M xN U M x NWN x N V T NxN Here, U and V are orthogonal (UT = U-1) and W is diag(w1,…,wN) with wi the singular values of X Suppose all wi = 0 are in the beginning, so wi = 0 for i = 1…L and wi ≠ 0 (i=L+1...L+N) T T U M x N diag ( wi ) N x N V N x N A N x N T T X M xN B M xN Then the least squares (L2) solution to the problem is: 1 T A0 X N x M BN x M U M x N diag V N x N w j With 1/wj replaced by 0 if wj = 0 So this formula tries to match every datapoint as closely as possible to the solution. But all possible solutions are: A A0 CV T with C = (cij)NxN where cij = 0 if j > L and otherwise just a scalar coefficient How to choose from the family of solutions ? The least squares method tries to match every datapoint as closely as possible → a not-so-sparse matrix with a lot of small entries. 1. Basing on prior biological knowledge, impose this on the solutions. e.g.: when we know 2 genes are related, the solution must reflect this in the matrix 2. Work from the assumption that normal gene networks are sparse, and look for the matrix that is most sparse thus: search cij to maximize the number of zero-entries in A So: • get as much zero-entries as you can • therefore get a sparse matrix • the non-zero entries form the connections • fit as much measurements as you can, exactly: “robust regression” (So you suppose exact measurements) Do this using L1 regression. Thus, when considering A A0 CV T we want to “minimize” A. The L1 regression idea is then to look for the T solution C where || A0 CV ||1 is minimal. This causes as many zeros as possible. Implementation was done using the simplex method (linear adjustment method) Thus, to reverse-engineer a network of N genes, we “only” need Mc = O(logN) experiments. Then Mc << N, and the computational cost will be O(N4) (Brute-force methods would have a cost of O(N!/(k!(N-k)!)) with k non-zero entries) Test 1 • Create random connectivity matrix: for each row, select k entries to be non-zero - k < kmax << N (to impose sparseness) - non-zero entry random from uniform distrib. • Do random perturbations • Do measurements while system relaxes back to its previous steady state → X • Compute X by interpolation • Do this M times Test 1 • Then apply algorithm to become approximation of A ~ • Computed error (with A the computed A): N N E eij i 1 j 1 ~ 1 if | Aij Aij | where eij 0 otherwise • Results: Mc = O(log(N)) • Better than only SVD, without regression: Test 2 • One-dimensional cascade of genes • Result for N = 400: Mc = 70 Test 3 • Large sparse gene network, with random connections, external stimuli,… • Results the same as in previous tests Discussion Advantages: • Very few data needed, in comparison with neural networks, Bayesian models • No prior knowledge needed • Easy to parallelize, as it recovers the connectivity matrix row by row (gene by gene) • Also applicable to protein networks Discussion Disadvantages: • Less efficient for small networks (M≈N) • No quantification yet of the necessary “sparseness”, though avg. 10 connections is good for a network containing > 200 genes • Uncertain X • Especially useful with exact data, which we don’t have Improvements • Other algorithms to impose sparseness: alternatives are possible both for L1 (basic criterion) as for simplex (implementation) • By using a deterministic linear system of ODEs, a lot has been neglected (noise, time delays, nonlinearities) • Connections could change by experiments; then the use of time-dependent W is necessary