Download Supporting Materials (Text) for “The Hindcast Skill of the CMIP

Supporting Materials (Text) for “The Hindcast Skill of the CMIP Ensembles for the Surface Air Temperature Trend” Koichi Sakaguchi1, Xubin Zeng1, and Michael A. Brunke1 Department of Atmospheric Sciences, University of Arizona, Physics-Atmospheric Sciences Bldg, 1118 E. 4th St., Tucson, AZ, 85721, USA July 10, 2012 for Journal of Geophysical Research - Atmospheres 1 Table S1. Modeling centers and references for the models used in this study. Centera CMIP3 Model Reference CM2.0 E-R ECHO-G NOAA/GFDL Delworth et al., 2006 NASA/GISS Schmidt et al., 2006 MIUB/KMA Roeckner et al., 1996; Wollf et al., 1997 MIROC3.2 medres HadGEM1 CGCM2.3.2 CCSM3 CCSR(University of Tokyo)/NIES/JAMSTEC K-1 Developers, 2004 MOHC Johns et al., 2006; Martin et al., 2006 MRI Yukimoto et al., 2006a; 2006b NCAR Collins et al., 2006 CMIP5 CM3 E2-R NOAA/GFDL Donner et al., 2011; Griffies et al., 2011 NASA/GISS http://data.giss.nasa.gov/modelE/ar5/ ESM-LR MIROC-ESM HadGEM2-ES CGCM3 CCSM4 MPI Roeckner et al., 2003; Bathiany et al., 2010 AORI (University of Tokyo)/NIES/JAMSTEC Watanabe et al., 2011 MOHC Collins et al., 2011; Martin et al., 2011 MRI Yukimoto et al., 2011 NCAR Gent et al., 2011 a NOAA: National Oceanic and Atmospheric Administration, GFDL: Geophysical Fluid Dynamics Laboratory, NASA: National Aeronautics and Space Administration, GISS: Goddard Institute for Space Studies, MIUB: Meteorological Institute of the University of Bonn, KMA: Korea Meteorological Administration, CCSR: Center for Climate System Research, NIES: National Institute for Environmental Studies, JAMSTEC: Japan Agency for Marine-Earth Science and Technology, MOHC: Met Office Hadley Centre, MRI: Meteorological Research Institute, NCAR: National Center for Atmospheric Research, AORI: Atmosphere and Ocean Research Institute 2 Text S2. Uncertainty Analysis S2.1. Uncertainty in Performance Statistics The performance statistics (e.g., RMSE, correlation) are calculated from running trend time series for each grid point at each spatiotemporal scale. Their sampling uncertainty is assessed using corrected variance based on the serial correlation to reflect the inter-dependence of the moving windows. For example, the lag-1 autocorrelation ranges from 0.78 (10-year running trend) to 0.98 (50-year running trend). For correlation, we followed the effective sample size of order two as reviewed and derived in Bretherton et al. [1999] [their equation (30)]. Based on this estimated effective sample size, the Fisher’s Z transformation is used for constructing the confidence interval and statistical significance test for the correlation. For RMSE, the same equation for the effective sample size is used to estimate the variance of the sample mean of the squared errors as a function of the autocorrelation of the squared error time series. The variance of the sampling distribution of the Brier score (BS) is derived by Bradley et al. [2008] (their equation (19)) and further modified by Wilks [2010] to incorporate the effect of serial correlation: n¢ 1- (1- m x ) {b(1- BS) r1 } (S1) = n 1+ (1- m x ) {b(1- BS) r1 }2 2 where n is the number of samples, n’ is the effective sample size, μx is the climatological probability of the event occurrence (in our case the mean probability of positive trend from all the available time windows for a given grid), BS is the Brier score, r1 is the lag-1 autocorrelation 3 of the predicted probability time series, and b is a parameter set to be 0.8 in this study, although ideally to be varied depending on the spatiotemporal scales and associated ensemble characteristics. Sensitivity of the variance to the parameter b is small such that the change of 0.1 in b results in the change of 0.001 for the square root of the sample variance (i.e., standard error for BS). For the rank histogram, the Chi-square test for uniformity (i.e., reliable prediction) is found to be sensitive to the added noise to reflect the uncertainty of the observation [Anderson, 1996; Candille and Talagrand, 2008], particularly at smaller scales in our analysis. Therefore, the Monte-Carlo approach was taken by making 1000 realizations of the rank histogram by adding different random noise to ensemble members. The random noise for a given grid is based on the difference between the running trend time series from HadCRUT3 and NCDC, assuming a normal distribution with mean of zero (since the mean difference of the two observational data is substantially smaller than those between model simulations and HadCRUT3) and the variance of the difference of the running trend time series. This ‘perturbed ensemble’ approach is also used to estimate the effect of the observational errors for RMSE, r, and BS as described in the main text (Section 2). The critical values in the standard Chi-square test assumes independent samples, and in order to take into consideration the effect of serial correlation Wilks [2004] introduced additive corrections to the critical values for a given significance level and lag-1 autocorrelation (his Table 2). Since the additive corrections are given for particular lag-1 autocorrelations (0.4, 0.5, ... 0.9), we interpolate them for different autocorrelation values by cubic spline interpolation (provided by MATLAB software). We did not interpolate the additive corrections at each grid point, instead we used typical lag-1 autocorrelations for the five temporal scales in our analysis; 4 0.78, 0.92, 0.96, 0.97, and 0.98. This simplified correction seems to be reasonable since most of the time the chi-square test values from the Monte-Carlo distribution do not exceed the critical values even without the corrections. S2.2. Field Significance It is desirable to test if the number of grid points (or area), at which a performance statistic is better than a given null hypothesis such as zero correlation with significance for a given level, exceeds the numbers of grid points with the local significance obtained by chance using a binomial distribution, and also whether the grid points with significance is affected by spatial correlation of the surface air temperature anomaly [Livezey and Chen, 1983]. For the 15°x15° and smaller spatial scales, the spatial correlation of the surface air temperature anomaly is not negligible [Hansen et al. 1999], and we take it into consideration by the Monte Carlo approach. We randomly resampled HadCRUT3 data 1000 times following the moving blocks bootstrap that keeps the spatial and temporal correlations in the data (Wilks [1997]; although the lag-1 autocorrelations of the resampled data is usually lower than the original observation by ~0.1 in our case). It gives the distribution of the numbers of grid points with local significance that can be obtained by chance under similar spatial and temporal correlations. For the 5°x5° and 50-year spatiotemporal scale, we were not able to produce bootstrap samples due to the higher fraction of missing data and strong autocorrelation of the trend time series. The null hypotheses for each statistic at grid level are: 1) RMSE is equal to or greater than the variability of the observed running trend. 2) Correlation is equal to or smaller than zero. 5 3) Brier Score is equal to or smaller than that obtained by a climatological probability of positive trend. This is actually explored by testing whether Brier Skill Score (BSS) is equal to or smaller than zero, BSS = BS - BSclim BS =10 - BSclim BSclim (S2) where BSclim is the BS that is obtained by using a constant climatological probability of the event [Wilks 2010]. The sample variance for BSS was derived by Bradley et al. [2008] separately from that for BS, and the correction based on the auto-correlation of the probabilistic forecasts was suggested by Wilks [2010]. In general the sample variance of BSS is less affected by the autocorrelation than that for BS. To summarize the result, field significance (at the α = 0.05 level for both local and field significance) was not obtained for the first null hypothesis concerning RMSE for any of the spatial scales explored here (15°x15° and smaller) with all the temporal scales. For correlation between the ensemble means and HadCRUT3 trends, the CMIP5 ensemble obtained field significance for the 10-year trend at all the three spatial scales while the CMIP3 ensemble did not obtain field significance at all the scales (again we consider all the temporal scales and the spatial scales of 15°x15° and smaller). For BS being better than climatology, both ensembles obtain field significance at all the scales except for the 10-year scale at all the three spatial scales in the CMIP3 ensemble. In the main text (section 3.1) we used a simpler approach to summarize the model performance at each scale by comparing the 75th percentiles across the grid points for RMSE and BS (i.e., 75% of the available grid points have smaller RMSE or BS) and the 25th percentiles 6 for correlation (i.e., 75% of the available grid points have higher correlation) to some reference values: unity for RMSE after normalized by the standard deviations of the observed running trend, 0.7 for correlation that corresponds to r2 ~ 0.5, and 0.25 for BS that corresponds to the score by a random 50-50 guess. This approach is simpler and easily reproducible. Even though this approach seems to throw away the sampling uncertainty at grid point level, the under- or overestimations of the performance statistics can potentially cancel out since a large number of grid points is involved to summarize the performance (75% of the available ones). S2.3. Uncertainties associated with model selection We have done sensitivity tests to infer how sensitive the main results of this study are to the selection of the models. First, we formed 21 different combinations (maximum possible number of combinations out of seven models) of five-model ensembles from seven models from CMIP5 (two of the seven models are different from the ensembles used for the results shown in this study). It is found that, among the 21 ensemble means, their RMSE is visually indistinguishable, the spread in correlation is less than 0.1, and the spread in BS from each ensemble is less than 0.03 except for the 50-year trend at large spatial scales where the spread can be as high as 0.1. Therefore, the deterministic performance of the ensemble mean does not seem to be so sensitive to model selection, but for probabilistic predictions on larger spatiotemporal scale (with higher sampling uncertainty) the model selection may have some impacts. The sensitivities of the probabilistic predictions are further explored with five more different ensembles, and the results are included in section 4 of the main text. 7 References Anderson, J. L. (1996), A method for producing and evaluating probabilistic forecasts from ensemble model integrations, J. Clim., 9, 1518-1540. Bathiany, S., M. Claussen, V. Brovkin, T. Raddatz, and V. Gayler (2010), Combined biogephysical and biogeochemical effects of large-scale forest cover changes in the MPI earth system model, Biogeosciences, 7, 1383-1399, doi:10.5194/bg-7-1383-2010. Bradley, A. A., S. S. Schwartz, and T. Hoshino (2008), Sampling uncertanity and confidence intervals for the Brier Score and Brier Skill Score, Wea. Forecasting, 23(5), 992-1006. Bretherton, C. S., M. Widmann, V. P. Dymnikov, J. M. Wallace, and I. Bladé (1999), The effective number of spatial degrees of freedom of a time-varying field, J. Clim., 12, 19902009. Candille, G., and Talagrand, O. (2008), Impact of observational error on the validation of ensemble prediction systems, Q. J. R. Meteorol. Soc. 13, 959-971. Collins, W. D., et al. (2006), The Community Climate System Model version 3 (CCSM3), J. Clim.,19(11), 2122-2143. Collins, W. J., et al. (2011), Development and evaluation of an Earth-System model – HadGEM2, Geosci. Model Dev., 4, 1051-1075, doi:10.5149/gmd-4-1051-2011. Delworth, T. L., et al. (2006), GFDL’s CM2 global coupled climate models. Part I: Formulation and simulation characteristics, J. Clim., 19(5), 643-674. Donner, L. J., et al. (2011), The dynamical core, physical parameterizations, and basic simulation characteristics of the atmospheric component AM3 of the GFDL global coupled model CM3, J. Clim., 24(13), 3484-3519. 8 Gent, P. R., et al. (2011), The Community Climate System Model version 4, J. Clim., 24, 49734991. Griffies, S. M., et al. (2011), The GFDL CM3 coupled climate model: Characteristics of the ocean and sea ice simulations, J. Clim., 24(13), 3520-3544. Hansen, J., R. Ruedy, J. Glascoe, and M. Sato (1999), GISS analysis of surface temperature change, J. Geophys. Res., 104(D24), 30,997-31,022 Johns, T. C., et al. (2006), The new Hadley Centre climate model (HadGEM1): Evaluation of coupled simulations, J. Clim., 19, 1327-1353. Jones, C. D., et al. (2011), The HadGEM2-ES implementation of CMIP5 centennial simulations, Geosci. Model Dev., 4, 543-570, doi:10.5149/gmd-4-543-2011. K-1 Model Developers (2004), K-1 Coupled Model (MIROC) Description, edited by H. Hasumi and S. Emori, Center for Climate System Research, University of Tokyo, Tokyo, Japan. Livezey, R. E., and W. Y. Chen (1983), Statistical field significance and its determination by Monte Carlo techniques, Mon. Wea. Rev., 111, 46-59. Martin, G. M., M. A. Ringer, V. D. Pope, A. Jones, C. Dearden, and T. J. Hinton (2006), The physical properties of the atmosphere in the new Hadley Centre Global Environmental Model, HadGEM1. Part I: Model description and global climatology, J. Clim., 19, 12741301. Martin, G. M., et al. (2011), The HadGEM2 family of Met Office Unified Model climate configurations, Geosci. Model Dev., 4, 723-757, doi:10.5149/gmd-4-723-2011. Roeckner, E., et al. (1996), The atmospheric general circulation model ECHAM4: Model description and simulation of present-day climate, MPI report No.218, Max-Planck-Institut für Meteorologie, Hamburg, Germany. 9 Roeckner, E., et al. (2003), The atmospheric general circulation model ECHAM5. Part I: Model description, MPI report No.349, Max-Planck-Institut für Meteorologie, Hamburg, Germany. Schmidt, G. A., et al. (2006), Present-day atmospheric simulations using GISSS ModelE: Comparison to in situ, satellite, and reanalysis data, J. Clim., 19(2), 153-192. Yukimoto, S., A. Noda, T. Uchiyama, S. Kusunoki, and A. Kitoh (2006a), Climate Changes of the twentieth through twenty-first centuries simulated by the MRI-CGCM2.3, Pap. Metor. Geophys., 56, 9-24. Yukimoto, S., et al. (2006b), Present-day climate and climate sensitivity in the Meteorological Research Institute Coupled GCM version 2.3 (MRI-CGCM2.3), J. Metor. Soc. Japan, 84(2), 333-363. Yukimoto, S., et al. (2011), Meteorological Research Institute-Earth System Model version 1 (MRI-ESM1), Technical Report of the Meteorological Research Institute, No. 64, Meteorological Research Institute, Tsukuba, Ibaraki, Japan. Watanabe, S., et al. (2011), MIROC-ESM 2010: model description and basic results of CMIP520c3m experiments, Geosci. Model Dev., 4, 845-872, doi:10.5194/gmd-4-845-201. Wilks, D. S. (1997), Resampling hypothesis tests for autocorrelated fields, J. Clim., 10(1), 65-82. Wilks, D. S. (2004), The minimum spanning tree histogram as a verification tool for multidimensional ensemble forecasts, Mon. Wea. Rev., 132, 1329-1340. Wilks, D. S. (2010), Sampling distributions of the Brier score and Brier skill score under serial dependence, Q. J. R. Meteorol. Soc., 136, 2109-2118. Wolff, J.-O., E. Maier-Reimer, and S. Lebutke (1997), The Hanburg Ocean Primitive Equation Model, DKRZ Technical Report No.13, Deutsches KlimaRechenZentrum, Hamburg, Germany. 10 Figure Captions Figure S1. Time series of global mean running linear trend from CMIP3 and CMIP5 models and two observations. The top two panels for 10-year, and the bottom two panels for 50-year trends. The left and right column shows CMIP3 and CMIP5 model simulations, respectively. Figure S2. Spatial distribution of RMSE against HadCRUT3 SAT trend. The x-axis shows the number of years for the linear trend, grouped for eight different spatial scales (labeled at the top of each panel with the same notation to Fig.1). Edges of the boxes represent the 25th and 75th percentiles for the statistics from all the grid points (black-NCDC, green: CMIP5-EM). The medians are shown by lines as appear in the legend. The dashed lines in global average subpanels show 95% confidence intervals. 11

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Supporting Materials (Text) for “The Hindcast Skill of the CMIP