Download PowerPoint - Isaac Newton Institute

On Mixture Models in High-Dimensional Testing for the Detection of Differential Gene Expression Geoff McLachlan Department of Mathematics & Institute for Molecular Bioscience University of Queensland ARC Centre of Excellence in Bioinformatics http://www.maths.uq.edu.au/~gjm Microarray Data represented as N x M Matrix Sample 1 Sample 2 Expression Signature Gene 1 Gene 2 Expression Profile Gene N Sample M M columns (samples) ~ 102 N rows (genes) ~ 104 The Challenge for Statistical Analysis of Microarray Data Microarrays present new problems for statistics because the data are very high dimensional with very little replication. The challenge is to extract useful information and discover knowledge from the data, such as gene functions, gene interactions, regulatory pathways, metabolic pathways etc. Detection of Differential Expression • Supervised Classification of Tissue Samples • Clustering of Gene Profiles Gene 1 Gene 2 . . . . Gene N Sample 2 . . . . Sample 1 Sample M Sample 2 . . . . Sample 1 Gene 1 Gene 2 . . . . Gene N Class 1 Class 2 Sample M Hedenfalk Breast Cancer Data • cDNA arrays, tumour samples from carriers of either the BRCA1 or BRCA2 mutation (hereditary breast cancers) • BRCA1/BRCA2 mutations associated with decreased ability for DNA repair, yet pathologically distinct • Dataset of M = 15 patients: 7 BRCA1 vs 8 BRCA2, with N = 3,226 genes. The problem is to find genes which are differentially expressed between the BRCA1 and BRCA2 patients. Hedenfalk et al. (2001) NEJM, 344, 539-547 An example of a gene from Hedenfalk et al (2001) breast cancer data Class 1 BRCA1: n1 = 7 tissues gene j Class 2 BRCA2: n2 = 8 tissues -0.587 -0.5 -0.0707 -0.265 -0.542 -0.522 0.265 -0.7 0.377 0.0318 -0.475 -0.627 -0.56 1.39 -0.4 x1  0.3173, s12  0.1002 x2  0.1203, s22  0.5066 s 2j  0.3190 tj  x1 j  x2 j sj 1 n1  n12  0.6739 Pj  2 F13 ( | t j |)  0.512 Supervised Classification (Two Classes) Sample 1 ....... Sample M ....... Gene 1 Gene N Class 1 (good prognosis) Class 2 (poor prognosis) Selection Bias Bias that occurs when a subset of the variables is selected (dimension reduction) in some “optimal” way, and then the predictive capability of this subset is assessed in the usual way; i.e. using an ordinary measure for a set of variables. Selection Bias Discriminant Analysis: McLachlan (1992 & 2004, Wiley, Chapter 12) Regression: Breiman (1992, JASA) “This usage (i.e. use of residual of SS’s etc.) has long been a quiet scandal in the statistical community.” Selection bias with SVM using RFE: GUYON, WESTON, BARNHILL & VAPNIK (2002, Machine Learning) “The success of the RFE indicates that RFE has a built in regularization mechanism that we do not understand yet that prevents overfitting the training data in its selection of gene subsets.” Ambroise, C. and McLachlan, G.J (2008). Selection bias in gene extraction on the basis of microarray gene-expression data. Proceedings of the National Academy of Sciences USA 99, 6562-6566. Figure 1: Error rates of the SVM rule with RFE procedure averaged over 50 random splits of colon tissue samples Nature Reviews Cancer, Feb. 2005 Number of Genes Error Rate for Top 70 Genes (without correction for Selection Bias as Top 70) Error Rate for Top 70 Genes (with correction for Selection Bias as Top 70) Error Rate for 5422 Genes 1 0.40 0.40 0.40 2 0.31 0.37 0.40 4 0.24 0.40 0.42 8 0.23 0.32 0.29 16 0.22 0.33 0.38 32 0.19 0.35 0.38 64 0.23 0.32 0.33 70 0.23 0.32 - 128 - - 0.32 256 - - 0.29 512 - - 0.31 1024 - - 0.32 2048 - - 0.35 4096 - - 0.37 5422 - - 0.37 Table 2: Ten-fold Cross-validated error rates for SVM based on subset with minimum error rate in RFE process with and without bias correction for optimization over subsets of varying sizes Dataset Error rate without bias correction for varying size of optimal subset Error rate with bias correction for varying size of optimal subset van ’t Veer (5422 genes) 0.29 0.37 Analysis of Breast Cancer Data • van’t Veer et al. (2002, Nature) • van de Vijver et al. (2002, NEJM) FDA has approved on 7/2/07 a new genetic test called MammaPrint®, "that determines the likelihood of breast cancer returning within five to 10 years after a woman's initial cancer." According to the FDA, the test, a product of Agendia (Amsterdam, the Netherlands), is the first cleared product that profiles genetic activity. MINDACT (Microarray for Node Negative Disease may Avoid Chemotherapy) prospective project, run by the EORTC (European Organisation for Research and Treatment of Cancer) and TRANSBIG, the translational research network of the Breast International Group • Zhu, J.X., McLachlan, G.J., Ben-Tovim, L., and Wood, I. (2008). On selection biases with prediction rules formed from gene expression data. Journal of Statistical Planning and Inference 38, 374-386. • McLachlan, G.J., Chevelu, J., and Zhu, J. (2008). Correcting for selection bias via cross-validation in the classification of microarray data. In Beyond Parametrics in Interdisciplinary Research: Festschrift in Honour of Professor Pranab K. Sen, N. Balakrishnan, E. Pena, and M.J. Silvapulle (Eds.). Hayward, California: IMS Collections, pp. 383395. Detection of Differential Expression • Supervised Classification of Tissue Samples • Clustering of Gene Profiles Microarray Data represented as N x M Matrix Sample 1 Sample 2 Expression Signature Gene 1 Gene 2 Expression Profile Gene N Sample M M columns (samples) ~ 102 N rows (genes) ~ 104 Clustering of Microarray Data Clustering of tissues on basis of genes: latter is a nonstandard problem in cluster analysis (n =M << p=N) EMMIX-GENE Clustering of genes on basis of tissues: genes (observations) not independent and structure on the tissues (variables) (n=N >> p=M) EMMIX-WIRE Clustering of gene expression profiles • Longitudinal (with or without replication, for example time-course) • Cross-sectional data EMMIX-WIRE EM-based MIXture analysis With Random Effects Ng, McLachlan, Wang, Ben-Tovim Jones, and Ng (2006, Bioinformatics) Supplementary information : http://www.maths.uq.edu.au/~gjm/bioinf0602_supp.pdf In the ith component of the mixture, the profile vector yj for the jth gene follows the model y j  Xβi  Ub ij  Vc i  ε ij p 1 m 1 qb 1 qc 1 p 1 b ij ~ N (0, H qb ) ci ~ N (0, ci I qc )  ij ~ N (0, A i ) A i  diag( Wφi ), φi  ( i21 ,,  iq2 e )T ( j  1,  , n ) Example: Hedenfalk Data n =15 tissues (7 from Class 1; 8 from Class 2) 3226 Genes y j  Xβ i  Ub ij  Vc i  ε ij where   Xβ i  X i1 ,  i 2  1 0       1 0  X , 0 1        0 1    bi1 j  , Ubij  X  bi 2 j   ci1    Vc i  I15    c   i ,15    i21  cov( εij )  diag( Wφi )  diag X 2   i 2  • Flack, L.K. and McLachlan, G.J. (2008). Clustering methods for gene-expression data. In Handbook of Research on Systems Biology Applications in Medicine, A. Daskalaki (Ed.). Hershey, Pennsylvania: Information Science Reference. To appear. • McLachlan, G.J., Bean, R., and Ng, S.K. (2008). Clustering of microarray data via mixture models. In Statistical Advances in Biomedical Sciences, A. Biswas, S. Datta, J. Fine, and M.R. Segal (Eds.). Hoboken, New Jersey: Wiley, pp. 365-384. • McLachlan, G.J., Bean, R., and Ng, S.K. (2008). Clustering. In Bioinformatics, Vol. 2: Structure, Function, and Applications, J. Keith (Ed.). Totowa, New Jersey: Humana Press, pp. 423439. Gene 1 Gene 2 . . . . Gene N Sample 2 . . . . Sample 1 Sample M Sample 2 . . . . Sample 1 Gene 1 Gene 2 . . . . Gene N Class 1 Class 2 Sample M An example of a gene from Hedenfalk et al (2001) breast cancer data Class 1 BRCA1: n1 = 7 tissues gene j Class 2 BRCA2: n2 = 8 tissues -0.587 -0.5 -0.0707 -0.265 -0.542 -0.522 0.265 -0.7 0.377 0.0318 -0.475 -0.627 -0.56 1.39 -0.4 x1  0.3173, s12  0.1002 x2  0.1203, s22  0.5066 s 2j  0.3190 tj  x1 j  x2 j sj 1 n1  n12  0.6739 Pj  2 F13 ( | t j |)  0.512 Using just the B permutations of the class labels for the gene-specific statistic tj , the P-value is assessed as: Pj  #{b :| t (b ) j || t j |} B where t(b)j is the null version of tj after the bth permutation of the class labels. If we pool over all N genes, then: B Pj   b 1 #{i :| t (b ) i || t j |, i  1,..., N } NB Multiplicity Problem When many hypotheses are tested, the probability of a false positive increases sharply with the number of hypotheses. Methods for dealing with the Multiplicity Problem • The Bonferroni Method controls the family wise error rate (FWER) i.e. the probability that at least one false positive error will be made  Too strict for gene expression data, tries to make it unlikely that even one false rejection of the null is made, may lead to missed findings • The False Discovery Rate (FDR) emphasizes the proportion of false positives among the identified differentially expressed genes.  Good for gene expression data – says something about the chosen genes The FDR and other error rates PREDICTED Retain Null Reject Null Null a Non-null c (false negative) d TRUE b (false positive) N r FDR ~ b N r FNDR ~ c ac The FDR and other error rates PREDICTED Retain Null Reject Null Null a Non-null c (false negative) d TRUE b (false positive) Nr FDR ~ b N r FNR = c cd b FDR  E{ } Nr  1 where Nr  1  max( Nr ,1) c FNDR  E{ } ( N  Nr )  1 Pj  2 F13 ( | t j |) z j  1[ F13 (t j )] Pj  2 F13 ( | t j |) z j   1 (1  Pj ) Local FDR Lee (2000), Efron et al (2001), Newton et al. (2001) proposed a two-component mixture model f ( z j )   0 f 0 ( z j )  (1   0 ) f1 ( z j )  0 ( z j )  pr{ jth gene is null | z j }   0 f0 ( z j ) f (z j )  0 f0 ( z j )  (by Bayes' theorem)  0 f 0 ( z j )  (1   0 ) f1 ( z j ) Global and Local FDR • The global FDR is concerned with the average number of false positives among the selected genes. • The local FDR gives a (probabilistic) assessment of differential expression for each gene. An approach using mixture models gives estimates for the local FDR as well as the global FDR and other error rates such as FNR. It also allows consideration whether an empirical null distribution should be adopted in place of the theoretical null. • McLachlan GJ, Bean RW, Ben-Tovim Jones L, Zhu JX. (2005).Using mixture models to detect differentially expressed genes. Australian Journal of Experimental Agriculture 45, 859-866. • McLachlan GJ, Bean RW, Ben-Tovim Jones L. (2006). A simple implentation of a normal mixture approach to differential gene expression in multiclass microarrays. Bioinformatics 26, 1608-1615. • Efron B et al (2001) Empirical Bayes analysis of a microarray experiment. JASA 96,1151-1160. • Efron B (2004) Large-scale simultaneous hypothesis testing: the choice of a null hypothesis. JASA 99, 96104. • Efron B (2004) Selection and estimation for large-Scale simultaneous inference. • Efron B (2005) Local false discovery rates. • Efron B (2006) Correlation and large-scale simultaneous significance testing. • Efron B (2006) Size, power and false discovery rates. • Efron B (2007) Simultaneous inference: when should hypothesis testing problems be combined. http://www-stat.stanford.edu/~brad/papers/ An example where local FDR is more informative: Glonek and Solomon (2003) F0: N(0,1), π0=0.6 F1: N(1,1), π1=0.4 Reject H0 if z≥2 Reject H0 if z≥2 τ0(2) = 0.99972 but FDR=0.17 τ0(2) = 0.251 but FDR=0.177 0.3 0.3 0.4 0.4 F0: N(0,1), π0=0.9 F1: N(1,1), π1=0.1 null 0.2 0.2 null 0.1 0.1 non-null 0.0 0.0 non-null -4 -2 0 2 4 -4 -2 0 2 4 The Procedure 1. Obtain the z-score for each of the genes zj   1 1  P  j 2. Rank the genes on the basis of the z-scores, starting with the largest ones (the same ordering as with the Pvalues, Pj). 3. The posterior probability of non-differential expression of gene j, is given by  0(zj). 4. Conclude gene j to be differentially expressed if ˆ 0(zj) < c0 If we conclude that gene j is differentially expressed if: ˆ 0(zj)  co, then this decision minimizes the (estimated) Bayes risk Risk  (1  co ) 0e01  co1e10 where e01 is the probability of a false positive and e10 is the probability of a false negative. Suppose 0(z) is monotonic decreasing in z. Then ˆ0 ( z j )  c0 for z j  z0 1  F 0( z 0) ˆ FDR  ˆ 0 1  Fˆ ( z 0) E 0( z) | z  z 0 Suppose 0(z) is monotonic decreasing in z. Then ˆ0 ( z j )  c0 for z j  z0 1  F 0( z 0) ˆ FDR  ˆ 0 1  Fˆ ( z 0) where F 0( z 0)   ( z 0)   ˆ z   0 1  Fˆ ( z 0)  ˆ 0( z 0)  ˆ1  ˆ1  For a desired control level a, say a = 0.05, define  z0  arg min FDˆ R( z)  a z If 1  F 0( z ) ˆ 0 1  Fˆ ( z )  (1) is monotonic in z, then using (1) to control the FDR [with ˆ 0  1 and Fˆ ( z ) taken to be the empirical distribution function] is equivalent to using the Benjamini-Hochberg procedure based on the P-values corresponding to the statistic zj. LOCAL FDR (POSTERIOR) PROB. Of GENE j BEING NOT DIFFERENTIALLY EXPRESSED  0 f0 ( z j ) 0(z j )   0 f 0 ( z j )  (1   0 ) f1 ( z j ) N(0,1)  0 f00 (z ( zjj))  0 (z j )   0 f 0 ( z j )  (1   0 ) f1 ( z j ) In order to proceed with estimation of π0 (can easily estimate f(zj) from z1,…,zN) we need to make the problem identifiable. Now f0(zj) is N(0,1) and we have to assume something about f1(zj). N(0,1)  0 f00 (z ( zjj))  0 (z j )   0 f00(z ( zjj))  (1   0 ) ff1 ((zz j)) 1 N(0,1) j N(μ1,σ12) Z-values, null case Z-values, +2 Z-values, +1 Z-values, +3 EMMIX-FDR A program has been written in R which interfaces with EMMIX to implement the algorithm described in McLachlan et al. (2006). We fit a mixture of two normal components to the P-values of the test statistics calculated from the gene expression data. Fit π0N(0,1) + (1- π0)N(μ1,σ12) via maximum likelihood. For given π0, MLEs of μ1, σ12 are determined: try various π0. When we equate the sample mean and variance of the mixture to their population counterparts, we obtain: z  ˆ 0 ˆ 0  ˆ1ˆ1 s  ˆ 0ˆ  ˆ1ˆ  ˆ 0ˆ1 ( ˆ 0  ˆ1 ) 2 z 2 0 2 1 2 When we are working with the theoretical null, we can easily estimate the mean and variance of the non-null component with the following formulae. ˆ12  {s z2  ˆ 0  ˆ 0 (1  ˆ 0 ) ˆ12 } /(1  ˆ 0 ) ˆ1  z /(1  ˆ 0 ) Estimated FDR N FDˆ R  ˆ0 ( z j ) I[ 0,c0 ] (ˆ0 ( z j )) /N r j 1 where N N r   I[ 0,c0 ] (ˆ0 ( z j )) j 1 Similarly, the false positive rate is given by N N j 1 j 1 FPˆ R  ˆ0 ( z j ) I[ 0,c0 ] (ˆ0 ( z j )) / ˆ0 ( z j ) and the false non-discovery rate and false negative rate by: N FNDˆ R  ˆ1 ( z j ) I ( c0 , ) (ˆ0 ( z j )) /( N  N r ) j 1 N N j 1 j 1 FNˆ R  ˆ1 ( z j ) I ( c0 , ) (ˆ0 ( z j )) / ˆ1 ( z j ) -4 -2 0 2 4 0 50 100 150 60 Null 20 40 Non–Null (DE genes) 0 Frequency 80 100 Fitting two component mixture model to Hedenfalk data Histogram of z-scores for 3226 Hedenfalk genes -4 -2 0 z 2 4 Ranking and Selecting the Genes Gene j Gene 1 . . . Gene 143 . . . . Gene 200 . . . Gene N Pj zj Local FDR ˆ ( z ) 0 j 0.0108 . . . 0.0998 0.1004 0.1010 . . 0.1252 FDR = Sum/143 = 0.06 co = 0.1 Proportion of False Negatives = 1 – Sum1/ 57 = 0.89 co Nr FDˆ R 0.1 143 0.06 0.32 0.88 0.004 0.2 338 0.11 0.28 0.73 0.02 0.3 539 0.16 0.25 0.60 0.04 0.4 742 0.21 0.22 0.48 0.08 0.5 971 0.27 0.18 0.37 0.12 FNˆ DR FNˆ R FPˆ R Estimated FDR and other error rates for various levels of threshold co applied to the posterior probability of nondifferential expression for the breast cancer data (Nr=number of selected genes) Comparison of identified DE genes Our method (143) Hedenfalk (175) 24 6 12 29 101 39 8 Storey and Tibshirani (160) Storey and Tibshirani (2003) PNAS, 100, 9440-9445 Uniquely Identified Genes: Differentially Expressed between BRCA1 and BRCA2 Gene UBE2B, DDB2 (UBE2V1) RAB9, RHOC ITGB5, ITGA3 PRKCBP1 HDAC3, MIF KIF5B, spindle body pole protein CTCL1 TNAFIP1 HARS, HSD17B7 GO Term DNA repair (cell cycle) small GTPase signal transduction integrin mediated signalling pathway regulation of transcription negative regulation of apoptosis cytoskeleton organisation vesicle mediated transport cation transport metabolism Estimates of π0 for Hedenfalk data • 0.52 (Broet, 2004) • 0.64 (Gottardo, 2006) • 0.61 (Ploner et al, 2006) • 0.47 (Storey, 2002) Using a theoretical null, we estimated π0 to be 0.65. Theoretical and empirical nulls Efron (2004) suggested the use of two kinds of null component: the theoretical and the empirical null. In the theoretical case the null component has mean 0 and variance 1 and the empirical null has unrestricted mean and variance. From Efron (2006) Theoretical null may not hold for 4 reasons 1. Failed assumptions • Maybe non-normality distorts student’s t-distribution • Can use permutation methods 2. Correlation across arrays • Student-t null density assumes independence across arrays • Permutation methods cannot help 3. Unobserved covariates (age, weight, stage) • Tend to widen null density of the zj’s • Permutation methods cannot help 4. Correlation across genes ˆ0 ( z j )   0 f 0 ( z j ) / fˆ ( z ) Estimation of f(z) does not require independence of zj’s Suppose (1), (2), or (3) is applicable but (4) is not (assume genes independent). null Zj may not be ~ N(0,1) i.e. theoretical null may not hold Thus: use empirical null  0 f00 (z ( z jj)  0 (z j )   0 f00 (z ( z jj))  (1   0 ) f11((z z j j)) N(μ0,σ02) N(μ1,σ12) μ0, σ02 are now estimated from the data. Call N(μ0, σ02) the empirical null distribution. Problem now is to fit  0 N ( 0 ,  )  (1   0 ) N ( 1 ,  ) 2 0 2 1 1. Specify an initial value of π0 (try theoretical null estimate and other estimates as before) 2. Rank zj’s and put Nπ0 smallest in null component and remainder in non-null component 3. Work out means/variances as if they are the true groups Can check for need of empirical null in place of theoretical null by comparing twice the increase in the log likelihood when fitting μ0, σ02 instead of fixing μ0=0 and σ02=1. From Efron (2006) Now suppose the zj’s are correlated (4th reason). Even if theoretical null N(0,1) is correct for an individual zj of a null gene, the zj’s for the null genes may not behave as N(0,1) variates in the ensemble of z1,…,zN. If they don’t, then the Benjamini-Hochberg procedure will break down using P-values based on theoretical null. Fit  0 N ( 0 ,  )  (1   0 ) N ( 1 ,  ) 2 0 2 1 Still using maximum likelihood, although the function we are maximizing is no longer the true likelihood due to correlation across the genes. 100 80 60 0 20 40 Frequency -4 -2 0 2 4 z-scores Breast cancer data: plot of fitted two-component normal mixture model with theoretical N(0,1) null and non-null components (weighted respectively by the estimated proportion of null and non-null genes) imposed on histogram of z-scores. 100 80 60 0 20 40 Frequency -4 -2 0 2 4 z-scores Hedenfalk breast cancer data: plot of fitted two-component normal mixture model with empirical null and non-null components (weighted respectively by the estimated proportion of null and non-null genes) imposed on histogram of z-scores. co Nr FDˆ R 0.1 143 0.06 0.32 0.88 0.004 0.2 338 0.11 0.28 0.73 0.02 0.3 539 0.16 0.25 0.60 0.04 0.4 742 0.21 0.22 0.48 0.08 0.5 971 0.27 0.18 0.37 0.12 Table 1. Theoretical Null FNˆ DR FNˆ R FPˆ R co Nr FDˆ R 0.1 62 0.07 0.23 0.93 0.00 0.2 212 0.13 0.20 0.77 0.01 0.3 343 0.17 0.18 0.64 0.02 0.4 504 0.23 0.15 0.51 0.05 0.5 644 0.28 0.13 0.41 0.07 Table 2. Empirical Null FNˆ DR FNˆ R FPˆ R co Nr FDˆ R 0.1 143 0.06 0.32 0.88 0.004 62 0.07 0.23 0.93 0.00 338 0.11 0.28 0.73 0.02 212 0.13 0.20 0.77 0.01 0.2 FNˆ DR FNˆ R Table 3. Theoretical versus Empirical Null FPˆ R Allison Mice Simulation Allison et al. (2002) generated data for 10 mice over 3000 genes. The data are generated in six groups of 500 with a value ρ of 0, 0.4, or 0.8 in the off-diagonal elements of the 500 x 500 covariance matrix used to generate each group. For a random 20% of the genes, a value d of 0, 4, or 8 is added to the gene expression levels of the last five mice. Ben-Tovim Jones, L., Bean, R.W., McLachlan, G.J., and Zhu, J.X. (2006). Mixture models for detecting differentially expressed genes in microarrays. International Journal of Neural Systems 16, 353-362. Theoretical null, ρ=0.8, d=4 Empirical null, ρ=0.8, d=4 Theoretical null, ρ=0.8, d=8 Empirical null, ρ=0.8, d=8 Clustering of gene expression profiles • Longitudinal (with or without replication, for example time-course) • Cross-sectional data EMMIX-WIRE EM-based MIXture analysis With Random Effects Ng, McLachlan, Wang, Ben-Tovim Jones, and Ng (2006, Bioinformatics) Supplementary information : http://www.maths.uq.edu.au/~gjm/bioinf0602_supp.pdf Example: Hedenfalk Data n =15 tissues (7 from Class 1; 8 from Class 2) 3226 Genes y j  Xβ i  Ub ij  Vc i  ε ij where   Xβ i  X i1 ,  i 2  1 0       1 0  X , 0 1        0 1    bi1 j  , Ubij  X  bi 2 j   ci1    Vc i  I15    c   i ,15    i21  cov( εij )  diag( Wφi )  diag X 2   i 2   i ( y j , c;  )  pr{Z ij  1 | y j , c}   i f ( y j | zij  1, ci ; i )  N(i,Bi, with  f ( y | z  1 , c ;  ) h j hj h h h 1 g i  Xi  Vci T Bi  Ai   biUU co Nr FDˆ R 0.1 62 0.07 0.23 0.93 0.00 257 0.06 0.32 0.79 0.01 212 0.13 0.20 0.77 0.01 480 0.10 0.27 0.63 0.02 0.2 FNˆ DR FNˆ R Table 4. Empirical versus Clustering Approach FPˆ R Summary • Mixture model based approach to finding DE genes is both convenient and effective • Gives measure of local as well as global FDR; also gives other error rates • Provides an empirical null for use when theoretical null may be misleading

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download PowerPoint - Isaac Newton Institute