Download use of data mining techniques for predicting electric - AGRO

TEKA Kom. Mot. i Energ. Roln. – OL PAN, 2011, 11c, 237–245 USE OF DATA MINING TECHNIQUES FOR PREDICTING ELECTRIC ENERGY DEMAND Krzysztof Nęcka Department of Power Engineering and Agricultural Processes Automation, Agricultural University of Cracow, Balicka Str. 116B, 30-149 Kraków, Poland e-mail: [email protected] Summary. The project of prediction models was developed in the Data Miner graphic environment, which allows one to determine the hourly electric energy demand for an agricultural and food industry plant. Whilst using data mining techniques, it is not necessary to know apriori the form of the theoretical model of the examined phenomenon; it is also not necessary to meet the assumptions, whilst it is possible to model very complex processes. Key words: data mining, prediction, electric energy, hour request for electric energy. INTRODUCTION The famous astronomer Tycho Brache (1546-1601) and his well-known assistant Johannes Kepler (1571-1630), who was involved both in astronomy and mathematics, are recognized to be forerunners of the data mining technique [Sokołowski 2005]. Brache tried to discover a mathematical formula that would allow prediction of the location of celestial bodies. However, his works were not successful. His student, Kepler, described this in the form of three simple equations, currently known as Kepler’s laws. Among others, Roy [1957] and Tukey [1962] wrote in the 1960s about the necessity of increasing the dynamics of searching for non-parametric methods, for which it would not be necessary to meet burdensome assumptions. Since that time, very dynamic development of data mining techniques has taken place, as well as many definitions of this issue have been formed and modified over time. Berry, Linoff [1997, 2000] define data mining as a process of examining and analysing large amounts of data in order to discover meaningful patterns and rules. Hand et al [2001], think that data mining is an analysis of usually large, previously collected data sets in order to discover new regularities and describe data in a new way, which is understandable and useful for their owner. The process of data modelling using data mining proposed in the 1990s by representatives of three companies (production from Germany, statistical from Great Britain and telecommunication from Denmark) was adopted as a standard and covers six phases: a) understanding of the problem, b) recognizing the data, c) preparing the data, d) modelling, e) evaluation of the model, f) implementation of the model. 238 Krzysztof Nęcka In this technique, it is not necessary to know apriori the form of the theoretical model of the examined phenomenon. It is also not necessary to meet the assumptions, but this does allow for modelling of very complex processes. The obtained models are usually very complex, and their interpretation is difficult or even impossible for man. They are analysed as black boxes, where input data is introduced and a correct prediction is expected as the output. They also enable one to apply traditional models, though usually issues related to the assumptions are analysed less restrictively. The quality of built models is evaluated mostly by the accuracy of predictions counted in the test set. Because of this, the techniques may be practically used to solve prediction issues [Demski 2007; Fijołek i in. 2010; Sokołowski 2007; Sokołowski, Pasztyła 2004; Tadeusiewicz 2006; Trojanowska, Nęcka 2009]. Transformation of the power sector into a market oriented economy in Poland, which took place in 2007 [Dyrektywa 2003/54/WE], caused that electric energy is currently treated as a market product. Due to necessity of balancing demand and supply in real time, this is a very specific product for which the increase of prediction quality is practical both for its suppliers and receivers. Accurate evaluation of electric energy demand within individual hours of the day simply translates into financial results for a company. If the receiver uses any other quantity of energy than planned and ordered within a given hour, the generated difference must be balanced at the balancing market. However, discrepancies between the energy price in a negotiated contract and at the dynamic balancing market may reach even one hundred per cent [Ciepiela 2007]. The goal of the paper was to built the most effective prediction system using the data mining technique that would enable modelling the hourly electric energy demand in agricultural and food industry plants due to the assumed quality measures. MATERIAL AND METHODS The goal of the paper was realized based on the results of own examinations performed using the AS-3 grid parameter analyser within the period from May 2010 to April 2011. This consisted in continuous measurement and recording of average 15-minute loads with active power and parameters characterizing the voltage quality in the transformer station supplying the District Dairy Cooperative. The examined company is located in the southern part of Poland, within the area of electric energy distribution provided by ENION GRUPA TAURON S.A. It is involved in the production of milk for consumption, dairy drinks, butter, cottage cheese and other dairy products, buying, for production purposes, from 15 to 17 million litres of milk every year. It is supplied with electric energy from a transformer and distribution station with 15/0.5kV of rated power 630 kVA, and the total power of installed electric receivers is 330 kW. By means of transfer of data from the AS-3 grid parameter analyser to the computer, a database was created in a form that allows its reading and further processing in the Statistica 9.1 program for statistical analysis of data. EXAMINATION RESULTS Analysis of load variability. Realization of the goal of the paper was preceded by an analysis of load variability in the examined plant. It allowed for the collection of general information on the variability of the examined time series, selection of potential control variables and discovery of the outliers. USE OF DATA MINING TECHNIQUES FOR PREDICTING 239 At first, an analysis was made of the monthly variability of electric energy demand. The course of changes in energy consumption within individual months is presented in the Fig. 1. 90 Monthly consumption of electric energy [MWh] 80 70 60 50 40 30 20 10 2010 Name of month/Year Aprill March February January December November October September August July June May 0 2011 Fig. 1. Monthly consumption of electric energy From the performed statistical analysis, the results show that monthly consumption of electric energy changes within a range from ca. 60 MWh to 80 MWh. The calculated value of the correlation coefficient has shown the existence of a statistically important dependence between the month and energy consumption. Therefore, the month was accepted as the potential control variable. The performed analysis of daily energy consumption similarity enabled separation of two characteristic days of the week, i.e. the working day and the non-working day, showing the highest similarity of load form [Nęcka 2011 a]. Therefore, it is advisable to divide the time series into two sets, i.e. working days and non-working days and to create separate prediction models. Fig. 2 presents the hourly variability of load within a winter and summer month. Fig. 2. Course of hourly demand for electric energy during the winter month (January) and the summer month (August) 240 Krzysztof Nęcka In Fig. 2 as well as in the entire examined period, no outliers were recorded. However, the seasonality of the electric energy demand is visible. In order to identify the seasonality, an autocorrelation analysis was performed examining the correlation between the values of the time series of data separated from each other by k points ACF(k)=CF(x,x,k) (Fig. 3). Fig. 3a presents the strong dependence between current electric energy consumption and its value one hour earlier. The value of the correlation coefficient for these two measurements is 0.94. Such a strong dependence between the variables enables one to assume that introduction at the input of the model of a variable delayed by one hour results in estimates burdened with minimal error. However, because of the specificity of the Polish energy market, the minimum value of delay that can be used practically is 41 hours [Gładysz 2009]. a) Delay Ratio of correlation b) Delay Ratio of correlation Fig. 3. Autocorrelation graph: (a) for k = 24 (b) for k = 200 Fig. 3b presents the course of the autocorrelation function for delay k equal to 200 hours. This shows sequences repeated in cycles, in which the amplitude decreases very slowly. The period of the basic cycle is 24 hours. An increase of the correlation coefficient for a delay of 168 hours to the value of 0.88 is clear, which indicates an occurance not only of daily seasonality, but also weekly. Hence, it is justified to predict the hourly electric energy demand based on previous observations. Moreover, results of the performed analyses [Nęcka 2011 b] showed that averaged profiles of energy consumption within individual days of the week and parameters characterizing the voltage quality delayed by one or a few seasonal periods are also statistically essential variables influencing electric energy demand. Construction of prediction models in the Data Miner operation environment Before starting the construction of prediction models, the input data was divided to two sets. The first set, being the teaching one, was created based on 18816 observations recorded within the first period of examinations. The second set, the so-called test set, was created based on 2976 observations recorded within the last period of performed examinations. The time horizon for all 241 USE OF DATA MINING TECHNIQUES FOR PREDICTING predictions developed was 48 hours due to the requirements put on participants of the electric energy market in Poland. To develop prediction models that allow for the determining of hourly electric energy demand, a project was generated in the Statistica Data Miner graphic environment, of which an example for working days is presented in Fig. 4. Data sources Preparation, cleaning, transformation of data Data analysis Reports Overall stepwise regression Standard regression CRT Exhaustive CHAID for regression Amplified regression Remove fixed variables Data division trees Artificial neuron networks (ANN) Multivariate Adaptive Regression p ( ) Splines (MARS). Matched quality factor Fig. 4. Project of prediction models generated in the Statistica 9.1 Data Miner graphic environment for working days The workspace in Statistica 9.1 Data Miner is divided into four sections that correspond to the four stages of the Data Mining project: • Data sources: Specification of location of files that include data to be used during the process of further modelling. • Preparation, cleaning, transformation of data: The node named Remove fixed variables allows automatic elimination of input variables that assume one value only. The Data division block is responsible for dividing data into a teaching and test set based on the coding variable saved in the prepared input file. • Data analysis: In the generated template of the design enabling prediction of hourly electric energy demand, the following models were used: 1 - overall stepwise regression; 2 - standard regression CRT tree; 3 - exhaustive CHAID for regression; 4 - amplified regression trees; 5 - artificial neuron networks (ANN); 6 - Multivariate Adaptive Regression Splines (MARS). A detailed description of individual models can be found in the literature [Koronacki, Ćwik 2005; Hastie i in. 2001; Kot, Jakubowski, Sokołowski 2011; Tadeusiewicz 1993]. Having started the project based on the selected methods, the program runs a proper number of analyses and generates 5 models that allow for the prediction of hourly electric energy demand. It also creates a new source of data for each model, which includes the results of its application for the test set. Rapid quality evaluation of the generated models and their comparison based, among other things, on the average quarterly error, average absolute error and correlation coefficient is possible thanks to the Matched quality factor node. • Reports: After completion of calculations, results summarizing individual analyses in the form of sheets are placed in the workspace of Data Mining projects. 242 Krzysztof Nęcka Evaluation of the predicting models The problem of the quality of the developed prediction is one of the most important issues related to prediction. Absolute average error, absolute average error modulus, absolute averagequarterly error, relative error of prediction, average relating percentage error, standard deviation of error, Janus coefficient, Thiel coefficient are commonly used as measures of errors [Cieślak 1999; Gajda 2001; Trojanowska, Nęcka 2009; Zeliaś i in. 2003]. Within the scope of evaluation of the developed predictions, their accuracy was checked (Tab. 1) by determining the following coefficients counted for the test set: • Mean Error – ME: ME = • 1 n ⋅ ∑ ( Et − Et* ). n i =1 Mean Absolute Error – MAE: MAE = • ( (2) Mean Absolute Percentage Error – MAPE: * 1 n Et − Et ⋅∑ ⋅100%. n i =1 Et (3) Standard Deviation of Errors –SDE: SDE = • ) 1 n ⋅ ∑ Et − Et* . n i =1 MAPE = • (1) n 2 1 ⋅ ∑ ( Et − Et* ) . n − 1 i =1 (4) Ratio of correlation – rEtE: ∑( E n rE E* = t t t )( − Et ⋅ Et* − Et* i =1 ∑( E n t i =1 − Et ) ⋅ ∑(E 2 n * t i =1 ) −E * t ) 2 ; where: Et – actual value of electric energy consumption in hour t, Et* – predicted value of electric energy consumption in hour t, Et – mean actual value of electric energy consumption in hour t, Et* – mean predicted value of electric energy consumption in hour t, n – number of the last observation of the predicted variable. (5) 243 USE OF DATA MINING TECHNIQUES FOR PREDICTING Table 1. Comparison of the quality of predictions generated using the selected methods Measure of error Method ME [kWh] MAE [kWh] MAPE [%] SDE [kWh] rE E t * t Overall stepwise regression -1,03 10,31 14,40 15,07 0,98 Standard regression CRT -4,79 15,47 19,47 21,85 0,94 Exhaustive CHAID for regression -2,33 12,02 16,61 17,13 0,97 Amplified regression trees -2,74 12,29 16,52 17,41 0,97 Artificial neuron networks (ANN); -1,39 10,98 15,40 16,77 0,97 Multivariate Adaptive Regression Splines (MARS) -1,19 10,31 14,23 15,25 0,98 Combined prognosis -2,24 10,81 13,23 11,34 0,97 110 100 Relative frequency [%] 90 S Standard regression CRT Exhaustive CHAID for regression Amplified regression trees ANN Overall stepwise regression MARS Combined prognosis 80 70 60 50 40 30 20 10 0 0 10 20 30 40 50 60 70 80 90 100 MAPE [%] Fig. 4. Cumulative distribution of prediction model errors for the test set CONCLUSIONS The lowest values of coefficients indicating the prediction error for teaching data was obtained in the case of the MARS model; however, the overall stepwise regression made it possible to generate a prediction of comparable quality. In both cases, whilst predicting hourly electric energy demand, we are wrong by ca. 14% on average, and the correlation coefficient value is very high and amounts to 98%. From the analysis of absolute average error results that all models generated within the examined period of time, overestimated the energy demand by 1 to almost 5 kWh on average. Whilst predicting demand per a given hour of the day, we are mistaken by up to 10 kWh in the case of the stepwise regression and MARS model, and by almost 16 kWh in the case of standard CRT trees. 244 Krzysztof Nęcka One percentage reduction of the average relative error of prediction is obtained for the combined prognosis determined as an average value from individual models. A significant reduction of errors with higher values is characteristic for this prediction. In order to perform the analyses one more time, e.g. after a change of parameters of individual models or after linking new data, it is sufficient to start the previously generated project template, and the entire calculation process shall be completed automatically. Monthly consumption of electric energy. Name of month/year. Number of observations. Measure of error. Relative frequency. REFERENCES 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. Berry M., Linoff G. 1997.: Data Mining Techniques for Marketing, Sales and Customer Support, Wiley, Hoboken, NJ. Berry M., Linoff G. 2000.: Mastering Data Mining, Wiley, Hoboken, NJ. Ciepiela D. 2007.: Koszty bilansowania – zmora klienta. [online]. [dostęp 27–01–2011]. Dostępny w internecie: http://energetyka.wnp.pl/tpa/poradnik_jak_zmienic_dostawce_energii/koszty-bilansowania-zmora-klienta,3359_2_0_1.html. Cieślak M. 1999.: Prognozowanie gospodarcze. Metody i zastosowanie. Wydawnictwo Naukowe PWN. Warszawa. Demski T. 2007.: Przykład prognozowania z wykorzystaniem metod data mining. [online]. [dostęp 27–01–2011]. Dostępny w internecie: http://www.statsoft.pl/czytelnia/artykuly/ Przyklad_prognozowania.pdf. Dyrektywa 2003/54/WE Parlamentu Europejskiego i Rady dotycząca wspólnych zasad dla wewnętrznego rynku energii elektrycznej i uchylenia dyrektywę 96/92/WE oraz rozporządzeni 1228/2003 w sprawie warunków dostępu do sieci w transgranicznej wymianie energii elektrycznej. Fijorek K., Mróz K., Niedziela K., Fijorek D. 2010.: Prognozowanie cen energii elektrycznej na rynku dnia następnego metodami data mining, Rynek Energii, nr 12. Hand D., Mannila H., Smyth P. 2001.: Principles of Data Mining, MIT Press, Cambridge. Hastie T., Tibshirani R., Friedman J. 2001.: The Elements of Statistical Learning. Data mining, Inference and Prediction. Springer Verlag. Gładysz B. 2009.: Metoda określania wielkości kontraktów na energię elektryczną. Badania operacyjne i decyzje. Nr 3. Wrocław. Gajda J. B. 2001.: Prognozowanie i symulacja a decyzje gospodarcze. Wydawnictwo C.H. BECK. Warszawa. Kot S., Jakubowski J., Sokołowski A. 2011.: Statystyka. Wydawnictwo DIFIN. ISBN 97883-7641-349-5. Nęcka K. 2011a.: Analiza sezonowości obciążeń w zakładzie przemysłu rolno-spożywczego. (Złożona do druku w Technika Rolnicza Ogrodnicza Leśna). Nęcka K. 2011b.: Wykorzystanie modeli regresyjnych do prognozowania godzinowego zapotrzebowania na energię elektryczną w zakładzie przemysłu rolno-spożywczego. (Złożona do druku w Technika Rolnicza Ogrodnicza Leśna). Roy S. N. 1957.: Some Aspects of Multivariate Analysis, Wiley, New York. Sokołowski A. 2007.: Data minind – automat czy metoda naukowa ? [online] [dostęp 2404-2011] Dostępny w Internecie: http://www.statsoft.pl/czytelnia/8_2007/Sokolowski05.pdf. USE OF DATA MINING TECHNIQUES FOR PREDICTING 17. 18. 19. 20. 21. 22. 23. 245 Sokołowski A., Pasztyła A. 2004.: Data mining w prognozowaniu zapotrzebowania na nośniki energii. [online]. [dostęp 27–04–2011]. Dostępny w Internecie: http://www.statsoft. pl/czytelnia/prognozowanie/04energia.pdf. Tadeusiewicz R. 2006.: Data mining jako szansa na relatywnie tanie dokonywanie odkryć naukowych poprzez przekopywanie pozornie całkowicie wyeksploatowanych danych empirycznych. [online] [dostęp 24-04-2011] Dostępny w Internecie: http://www.statsoft.pl/ czytelnia/8_2007/Tadeusiewicz06.pdf. Tadeusiewicz R. 1993.: Sieci neuronowe. Akademicka Oficyna Wydawnicza. Warszawa. Trojanowska M., Nęcka K. 2009.: Wykorzystanie data mining do prognozowania obciążeń szczytowych wiejskich stacji transformatorowych Мeждyнapoднaя Koнфepeнcия ЭНEPГOOБECПEЧEНИE И БEЗOПACНOCТЂ. Wydawnictwo Orłowskiego Państwowego Uniwersytetu Rolniczego OрeлГAУ, s.308-310, Orieł k/Moskwy (Rosja). Trojanowska M., Nęcka K. 2009.: The use of alternative regression models to determine peak loads in rural transformer stations. TEKA Komisji Motoryzacji i Energetyki Rolnictwa Vol. IX. Lublin. s. 364-369. Tukey J. 1962.: The Future of Statistics, Annals of Mathematical Statistics, 33, 1-67. Zeliaś A. 1997.: Teoria prognozy. PWN. Warszawa. WYKORZYSTANIE TECHNIK DATA MINING DO PROGNOZOWANIA ZAPOTRZEBOWANIA NA ENERGIĘ ELEKTRYCZNĄ Streszczenie. Opracowano w środowisku graficznym Data Miner projekt modeli predykcyjnych, pozwalających na wyznaczanie godzinowego zapotrzebowania na energię elektryczną dla zakładu przemysłu rolno-spożywczego. Dzięki wykorzystaniu w nim technik data mining nie jest wymagana znajomość apriori postaci modelu teoretycznego analizowanego zjawiska, nie jest również konieczne spełnienie założeń i możliwe jest modelowanie bardzo złożonych procesów. Słowa kluczowe: data mining, prognoza, energia elektryczna, godzinowe zapotrzebowanie na energię.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download use of data mining techniques for predicting electric - AGRO