* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Download Construction of SARIMAX
History of numerical weather prediction wikipedia , lookup
Theoretical computer science wikipedia , lookup
Predictive analytics wikipedia , lookup
Data analysis wikipedia , lookup
Neuroinformatics wikipedia , lookup
Inverse problem wikipedia , lookup
Operational transformation wikipedia , lookup
Pattern recognition wikipedia , lookup
Computer simulation wikipedia , lookup
Corecursion wikipedia , lookup
Least squares wikipedia , lookup
SYSTEMS ANALYSIS LABORATORY Construction of SARIMAXmodels using MATLAB Mat-2.4108 Independent research projects in applied mathematics Antti Savelainen, 63220J 9/25/2009 Contents 1 Introduction ...........................................................................................................................3 2 Existing MATLAB functions for ARMAX-models....................................................................4 3 MATLAB implementation for SARIMAX-models....................................................................4 4 Numerical example ...............................................................................................................7 5 Comparison of MATLAB and SAS software ........................................................................ 15 Results ............................................................................................................................... 15 Usability ............................................................................................................................. 16 6 Conclusion .......................................................................................................................... 16 Bibliography .............................................................................................................................. 19 2 1 Introduction The course Mat-2.3132 Systems Analysis Laboratory I [12] covers time series analysis and implementation of seasonal autoregressive integrated moving average models with an external input i.e. SARIMAX-models. The basis for SARIMAX-models is an ARMA-model, which contains only autoregressive and moving average parts. Models are utilized to forecast company’s electricity consumption. First, the task is to identify an appropriate SARIMA-model [1] to fit the data and then the external data is added and the model becomes a SARIMAX-model. The data consists of company’s electricity consumption and the outdoor temperature at one hour interval of a 4 weeks period. The outdoor temperature is possibly used as an external variable (X term) in the model, if the data correlate with each other. The consumption and the temperature are plotted with SAS software [15] in Figure 1. As you can see, there is evident linear trend and seasonal behavior at least with periods 24 and 168 hours in the data, and so a SARIMAX-model is possibly identified. The purpose of this research project is to construct a MATLAB [8] implementation of MATLAB´s functions for building, identifying, fitting and checking models for time series, which is a sequence of successive and independent data points. This implementation enables to use the Box-Jenkins methodology [1] to forecast the unknown values of stochastic time series. This project accomplished functions in MATLAB to differentiate nonstationary time series, identify and build an appropriate SARIMAX-model, decide that the model is adequate and forecast with the ready-made model [1]. Next, the devices are exploited in a numerical example to forecast company´s electricity consumption data given in the course Mat-2.3132 Systems Analysis Laboratory I. At present, SAS software is used as a statistics tool to construct a SARIMAX-model. SAS software is able to compute seasonal SARMA-models, ARIMA-models with an integrated data, ARMAX-models with an external variable and all combinations of these different kinds of models. Although, within this course it needs to be run on a remote computer via SSH connection, which is not desirable. After this research project students should be able to use MATLAB to estimate SARIMAX-model’s parameters on their own workstations. Furthermore, this research project compares capability of MATLAB and SAS to build, identify and check SARIMAX-models. They are compared according to their numerical results and applicability as a statistical program is analyzed regarding the SARIMAX-models. In the end, there is a short review of alternative programs to be used on the course Mat-2.3132 Systems Analysis Laboratory I. 3 Figure 1 Original consumption (in green) and the temperature (in purple) data plotted with SAS 2 Existing MATLAB functions for ARMAX-models MATLAB contains a System Identification Toolbox [10], which offers a possibility to construct mathematical models of dynamic systems. This toolbox lets you fit linear and non-linear models to the data, where as the Box-Jenkins methodology aims to fit a suitable linear model to time series and then optimize the values of parameters by maximizing the likelihood function [14]. The likelihood function depends on the sample values and the unknown parameters of the model and an algorithm estimates those parameters which most likely would generate the sample. MATLAB´s System Identification Toolbox contains two functions, which made possible to implement a statistics tool to construct a SARIMAX-model. A function armax estimates parameters for an ARMA- or ARMAX-model. This was the essential thing that made it possible to extend the MATLAB function to estimate SARIMAX-models. After the parameters have been estimated, the ARMAX-model is used to forecast the time series values in the future with a function predict. 3 MATLAB implementation for SARIMAX-models Despite the possibility to estimate parameters for an ARMA- and ARMAX-model, MATLAB is insufficient to be used as a statistical tool on the course Mat-2.3132 Systems Analysis Laboratory I. MATLAB lacked ready functions especially for identifying, building and checking for SARIMAX-models. Identifying demanded MATLAB to be able to produce autocorrelation and partial autocorrelation functions. Both the autocorrelation and partial autocorrelation function are important when deciding the order of parameters in an ARMA-model and differentiating order. Additionally, a cross correlation function is implemented to perceive the correlation between the electricity consumption and temperature with different lags. Moreover, a differentiating function is needed to be able to differentiate data not only by 1 but also with other intervals such as lengths of seasonal periods. 4 A spectral density function is implemented as a function specdens. That function is a measure of the signal´s energy between different frequencies and it is used to characterize the properties of a signal. Mathematically a signal´s spectrum is the square of the absolute value of its Fourier transform [7]. As you can see in Figure 1, the data is notably nonstationary and needs to be differentiated according to definition of ARMA-models. The lack of differentiation with parameters from one in MATLAB was solved by making a simple differentiation function differ. The building of seasonal models needed most work to implement it with the armax function in MATLAB, because it was only capable to construct ARMAX- and ARMA-models. The seasonal ARMA (SARMA(p,q)x(P,Q)) function is made as an ARMA function, but some of the parameters are locked to zero. For instance, let the season length of an SAR-part be be Q. First, a full-length ARMA(S,Q)-model is created where is white noise, B is a lag operator and and a SMA-part to . Polynomials and contain the model parameters and to be estimated. This model is then modified to be a SARMA(p,q)X(P,Q) model by setting the parameters [ [ ,…, ,…, ] and ] to zero. This procedure provides polynomials and This is extensible to vector formats of the parameters P and Q. . Since the ARMA-model estimates differentiated data it becomes a SARIMA-model and the estimated data needs to be integrated. Two different ways were attempted to integrate estimated data of a model where is the input data and is the parameter of input to be estimated. First idea is to estimate the SARMAX-model normally regardless of differentiated data and afterwards the estimate is integrated by summing up the values of estimate with each other. Intuitively this should work, but in practice little errors in the beginning of the ex post -estimate become multiple because the estimate sums up with erroneous values again and again. Consequently, this summing method didn’t work well, despite the estimate’s profile is close to the original data, shown in Figure 2. 5 9 x 10 4 Estimated Original 8 Consumption (kWh) 7 6 5 4 3 2 1 0 0 100 200 300 400 Time (h) 500 600 700 Figure 2 Original and estimated ex post -forecast integrated by summing the values of the estimate Another way to do the integration of is that the data is not integrated but the ARIMA-model’s AR-part is revised instead. Let the data in ARMA-model be differentiated by This feature is implemented by function rearrange, which arranges the AR-part again. As a result, the model estimates integrated data shown in Figure 3 and it worked a lot better than the preceding attempt to sum up the values of estimate with each other. Note that the length of AR-part increases by the rearrangement of differentiation parameters. 6 9 x 10 4 Estimated Original 8 Consumption (kWh) 7 6 5 4 3 2 1 0 100 200 300 400 Time (h) 500 600 700 Figure 3 Original and estimated ex post -forecast integrated by re-arranging the AR-part of the model The checking stage of Box Jenkins modeling is mostly based on analyzing the residuals of an ex post-estimation. The residuals should be normally distributed and uncorrelated with each other. This is diagnosed by looking at the residuals’ autocorrelation and partial autocorrelation functions and a normal probability plot or with help of Ljung-Box test [5] implemented in MATLAB as a function ljungbox. where is the sample autocorrelation of residuals at lag is hypothesis of randomness for significance level . The critical rejection for the where [3] is the -quantile of the chi-square distribution with degrees of freedom. In practice, it turns out hard to find a SARIMAX-model in MATLAB with residuals that are random according to Ljung-Box test. The hypothesis of randomness is rejected at least a significance level of 0.9. 4 Numerical example Consider the data to be same as in Figure 1. As you can see in Figure 1, the consumption of electricity is nonstationary. There seems to be a linear trend both in the consumption and temperature data. This is tested by estimating a model 7 Where is the consumption of electricity. Function arfunc gains an estimate It is arguable to differentiate data by 1, because but when is so close to one. If is stationary, time series is nonstationary [1]. In Figure 4 is a power spectra of electricity consumption. The x-axis refers to the entire signal´s frequency of density spectrum scaled . There are spikes at least at frequencies 0.1 and 0.62. This indicates periodicity at lags 24 and 168, because and , where 816 is the length of the electricity consumption data vector. 26 and 161 are approximately 24 and 168 hours, which are the reasonable values of periods such as a day and a week. 7 x 10 6 6 5 4 3 2 1 0 -1 0 0.5 1 1.5 2 Frequency 2.5 3 3.5 Figure 4 Electricity consumption´s power spectrum 8 Autocorrelation function 1 0.8 0.6 AC 0.4 0.2 0 -0.2 -0.4 0 100 200 300 400 500 Lag Value 600 700 800 900 Figure 5 Autocorrelation function of the consumption of electricity differentiated by one According to the autocorrelation function of the once differentiated electricity consumption data (see Figure 5), there seems to be seasonal behavior in the consumption data with periods 24 and 168 hours. Consequently they are potential differentiating orders. Models and are estimated and gains the parameter values and . The differentiation order 168 is chosen, because . Thus, the data is differentiated by 1 and 168. Hence, the differentiated data means the data differentiated by 1 and 168. 9 Autocorrelation function 1 0.8 0.6 AC 0.4 0.2 0 -0.2 -0.4 0 100 200 300 400 Lag Value 500 600 700 Figure 6 Autocorrelation function of the differentiated electricity consumption data Partial auto c orrelation function 0.3 0.2 0.1 PAC 0 -0.1 -0.2 -0.3 -0.4 0 50 100 150 200 250 Lag Value 300 350 400 450 Figure 7 Partial autocorrelation function of the differentiated electricity consumption data The autocorrelation function presented in Figure 6 has two spikes next to each other at lags one and two and a spike at a lag 168 which indicates seasonal behavior and a SMA-model. The partial autocorrelation function presented in Figure 7 has as well two spikes next to each other at lags one and two but also seasonal spikes at lags 24 (and multiples of 24) and 168 which indicate seasonal behavior and a SAR-model. Thus the following SARIMA model is selected 10 The external variable temperature is differentiated as well by 1 and 168. The cross correlation between the differentiated data is plotted in Figure 8 and there seems to be correlation between the electricity consumption and the outdoor temperature. The x-axis refers to lags of the input, and highest spikes exist at around lags 8, -11 and -16. The consumption of electricity is considered to be dependent on the outdoor temperature (not vice versa) and thereby only the positive lag values are taken into consideration. 1.5 x 10 5 Cross correlation of electricity and temperature 1 0.5 0 -0.5 -1 -50 -40 -30 -20 -10 0 10 20 30 40 50 Figure 8 The cross correlation between differentiated electricity consumption and temperature 1590 Std Error Estimate 1585 1580 1575 1570 1565 0 5 10 15 20 25 Input lag Figure 9 Standard error estimate of model with different input lags Standard error of estimates with different input lags in Figure 9 doesn’t support the cross correlation between the differentiated data in Figure 8, because the minimum standard error of estimate (SEE) is achieved with an input lag 1 (SEE = 1348.9). Although, the differences of standard error of estimates between different lags are small-sized. In consequence, the cross 11 correlation function cannot be used as a tool in MATLAB for selecting the most appropriate input lag in the model. To summarize, under these circumstances the following SARIMAX model is estimated 9 x 10 4 Estimated Original 8 Consumption (kWh) 7 6 5 4 3 2 1 0 100 200 300 400 500 600 700 Time (h) Figure 10 Ex post-forecast of the model x 10 8.5 4 Estimated Original Consumption (kWh) 8 7.5 7 6.5 6 5.5 570 580 590 600 610 Time (h) 620 630 640 Figure 11 Ex post-forecast of the model 12 5000 4000 2000 1000 0 -1000 -2000 -3000 -4000 -5000 200 250 300 350 400 450 Time (h) 500 550 600 Figure 12 Residuals of ex post-forecast Autocorrelation function 1 0.8 0.6 AC Residual (kWh) 3000 0.4 0.2 0 -0.2 50 100 150 Lag Value 200 250 Figure 13 Autocorrelation of residuals 13 Normal Probability Plot 0.999 0.997 0.99 0.98 0.95 0.90 Probability 0.75 0.50 0.25 0.10 0.05 0.02 0.01 0.003 0.001 -4000 -2000 0 2000 4000 6000 Data Figure 14 Normal probability plot of residuals Figures 12, 13 and 14 show that residuals can be considered white noise. There is no autocorrelation between residuals and normal probability plot forms an approximate straight line. Ljung-Box test for residuals gains a value 4.7419 with 1 degree of freedom. Ljung-Box test indicates that the hypothesis of randomness can be rejected for significance level 97.5 %. It turns out to be hard to find a model in MATLAB with residuals that are random according to Ljung-Box test. Contrary to the Box-Jenkins methodology, we do not return to step one and build a better model, because no other model yield residuals that are random according to Ljung-Box test. The forecast given by the chosen model is in Figure 15. 10 x 10 4 Estimated 9 Consumption (kWh) 8 7 6 5 4 3 2 1 0 2 4 6 8 10 12 14 Time (h) 16 18 20 22 24 Figure 15 Ex ante -forecast for next 24 hours of electricity consumption 14 5 Comparison of MATLAB and SAS software MATLAB and SAS yields different results and they offer a different usability. It is hard to say which one is better, because there isn´t such a best model or usability. Results In MATLAB autocorrelation and partial autocorrelation functions are implemented based on their mathematical definitions and they are similar to the functions computed by SAS software, as expected. There are differences between models´ parameter estimates and models´ standard error of estimates between MATLAB and SAS. Even if models and differentiations are exactly same, there are differences in estimated parameters and naturally they produce different model´s standard error of estimate. It seems that forecasts estimated by MATLAB have lower standard error of estimates. For example, SAS achieves 2396 model´s standard error of estimate with the same model that was chosen above. The differences between MATLAB and SAS results either from estimation algorithms, initial conditions or iteration tolerances related to the algorithm that estimate model´s parameters. SAS produces automatically a lot of information about the model parameters´ distributions and correlation with each other. In addition, SAS prints AIC (Akaike's information criterion [6]) and SBC (Schwarz's Bayesian information criterion [6]) and the variance of the ex ante-estimate, which increases in time. In MATLAB, the user is itself in response to produce that same information. None of these above-mentioned missing features were not programmed during this research project with MATLAB. In SAS software it is a built-in feature that the chosen input lag value is automatically modulated to work with the integrated data in an ARMAX-model. In MATLAB, the cross correlation function is not as informative as in SAS. The cross correlation is calculated between differentiated data and has nothing to do with real world anymore, where as in SAS user observes a cross correlation function that is comparable to the real phenomenon. The cross correlation produced by SAS describes unambiguously the coefficient between the temperature and electricity consumption shown in Figure 16. It is constantly negative, because the colder it is the more electricity is consumed. Besides, the absolute value of the correlation coefficient is the greatest at a lag 12 which indicates that the factory reserves heat about twelve hours. 15 Figure 16 Autocorrelation (upper) and cross correlation (lower) function produced by SAS Usability MATLAB lets a user to handle all the data as arrays. Thus, the user is able to get the certain information that is needed and plot whatever needed. This makes modeling process easier, faster and more understandable. Unfortunately, the 95 % confidence level for ex ante-forecast is not computed in MATLAB. 6 Conclusion A MATLAB implementation to use the Box-Jenkins methodology was created, but as the comparison of MATLAB and SAS software shows, there are differences between MATLAB and SAS software as a statistics tool. MATLAB works doesn´t yield random residuals according to Ljung-Box test and the lag of an external variable works illogically. MATLAB´s System Identification Toolbox is not precisely designed to estimate time-series models. The point of view is different, because this toolbox is especially intended for modeling systems from the measured input-output data illustrated in Figure 17. Figure 17 An ARMAX-model structure In this point of view, the factory would be a system that produces an observable signal, the consumption of electricity . The system is affected by external signals, the outdoor temperature and a disturbance signal, white noise . In this case, neither of the signals is controllable. After the data of the system have been observed, the goal is to link observations together into a dynamic system, which means that the current output value depends not only on 16 the current external stimuli and disturbance but also on their earlier output values [6]. Dynamic systems are efficient tools to identify how the output depends on some certain property of the system. In this case, for example how the consumption of electricity is affected by the width of the walls, the size of windows or a heating system used to keep the factory warm. Because the nature of MATLAB, other possibilities to construct SARIMAX-models were screened from the Internet. An ARMA-model can be made with Excel, for example, but it was harder to find a program that is able to construct SARIMAX-models. R programming language [2] is a free software language environment for statistical computing. The R programming language is able to compute at least ARIMA-models, but it is not originally designed to handle multivariate models such as ARMAX-models. The function in the R programming language can be modulated by using a function called arima to estimate parameters of an ARMAX-model. The function arima is originally designed to compute ARIMAmodels. Anyhow, this appears to be even more complicated than in MATLAB. Scilab [16] is a free scientific software and is able to estimate an ARMAX-process. It is on the same line with MATLAB, because this armax-function needs to be modulated to be able to estimate SARIMAX-models´ parameters. Unfortunately, it seems that only the commercial software and adherent toolboxes are able to compute SARIMAX-models and methods that are needed in time series modeling. All the free programming languages need programming to be able to utilize the whole Box-Jenkins methodology. MATLAB and SAS are not the only commercial software that are able to compute ARMAXmodels. For example, AUTOBOX [19] offers a complete set of Box-Jenkins modeling tools. National Instruments has a product NI LabVIEW 2009 [4] which offers a tool to estimate parameters of an ARMAX-model. Additionally, Econometric Software, Inc. [17]offers model frameworks for Box-Jenkins methodology where as Timberlake Software [18] offers a time series analysis feature as well, which contains an estimator for ARMAX-models. MATLAB is able to compute SARIMAX-models with functions of this research project. Results, such as parameters estimates, differ from SAS, but the functions are usable on the course Mat2.3132 Systems Analysis Laboratory I. SAS prints more information in a shorter time and is obviously more validated than the functions made in this research project. Consequently, some kind of software testing would be needed. Free software for SARIMAX-modeling were not found from the Internet. SAS software and Econometrics Toolbox [8] for MATLAB still seem to be the best alternative software to the solution of this research project, because SAS is already in use, although via SSH connection and students are already familiar with MATLAB. It is intuitively better, because MATLAB is already in use and only one toolbox is a cheaper alternative than to buy a new program. In the future, the functions of this research could be an appropriate tool for the course Mat2.3132 Systems Analysis Laboratory I, with a little more work. The functions need to validated and some properties, such as the 95 % confidence levels on ex ante-forecast and some additional statistical data regarding the estimated parameters could be added in the MATLAB. 17 18 Bibliography [1] Box, G. E., & Ljung, G. (1970). Time Series Analysis: forecast and control. San Fransisco: Holden-Day Inc. [2] Foundation, T. R. (n.d.). The R Project for Statistical Computing. Retrieved September 25, 2009, from http://www.r-project.org [3] Karr, A. F. (1993). Probability. New York: Springer-Verlag New York, Inc. [4] LabVIEW, N. (n.d.). The Software that powers virtual instrumentation. Retrieved September 25, 2009, from http://www.ni.com/labview/ [5] Ljung, G. M., & Box, G. E. (1978). On a measure of lack of fit in time series models. Biometrika , 297-303. [6] Ljung, L. (1987). System Identification Theory For The User. New Jersey: Prentice-Hall, Inc. [7] Ljung, L., & Glad, T. (1994). Modeling Of Dynamic Systems. New Jersey: Prentice-Hal, Inc. [8] Mathworks. (n.d.). Econometrics Toolbox Matlab. Retrieved September 30, 2009, from http://www.mathworks.com/products/econometrics/ [9] Mathworks, T. (n.d.). The Mathworks - MATLAB and Simulink for Technical Computing. Retrieved September 25, 2009, from www.mathworks.com [10] MATLAB. (n.d.). System Identification Toolbox - MATLAB. Retrieved September 29, 2009, from www.mathworks.com/products/sysid/ [11] Milton, S. J., & Arnold, J. C. (2003). Introduction to probability and statistics. McGraw-Hill Companies, Inc. [12] Noppa. (2009, September 16). Noppa - Työ 2. Retrieved September 25, 2009, from https://noppa.tkk.fi/noppa/kurssi/mat-2.3132/tyo_2 [13] Pakanen, J., & Karjalainen, S. (2002). An ARMAX-model approach for estimating static heat flows in buildings. Espoo: VTT Publications. [14] Pindyck, R. S., & Rubinfield, D. L. (1998). Econometric models and economic forecasts. Singapore: McGraw-Hill Book Co. [15] SAS. (n.d.). SAS Business Analytics and Business Intelligence Software. Retrieved September 25, 2009, from www.sas.com [16] Scilab. (n.d.). Scilab Home Page. Retrieved September 25, 2009, from http://www.scilab.org 19 [17] Econometrics Software (n.d.). Progam Features - Capabilities - Time Series. Retrieved 9 25, 2009, from http://www.limdep.com/features/capabilites/time_series.php [18] Timberlake Software (n.d.). LIMDEP & NLOGIT. Retrieved September 25, 2009, from http://www.timberlake.co.uk/software/limdep/limdepkey.html [19] Autobox Systems (n.d.). Autobox Overview. Retrieved September 28, 2009, from www.autobox.com/autobox.htm 20