Download use of data mining techniques for predicting electric - AGRO

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
TEKA Kom. Mot. i Energ. Roln. – OL PAN, 2011, 11c, 237–245
USE OF DATA MINING TECHNIQUES FOR PREDICTING
ELECTRIC ENERGY DEMAND
Krzysztof Nęcka
Department of Power Engineering and Agricultural Processes Automation,
Agricultural University of Cracow, Balicka Str. 116B, 30-149 Kraków, Poland
e-mail: [email protected]
Summary. The project of prediction models was developed in the Data Miner graphic environment, which allows one to determine the hourly electric energy demand for an agricultural and food industry plant. Whilst using
data mining techniques, it is not necessary to know apriori the form of the theoretical model of the examined phenomenon; it is also not necessary to meet the assumptions, whilst it is possible to model very complex processes.
Key words: data mining, prediction, electric energy, hour request for electric energy.
INTRODUCTION
The famous astronomer Tycho Brache (1546-1601) and his well-known assistant Johannes
Kepler (1571-1630), who was involved both in astronomy and mathematics, are recognized to be
forerunners of the data mining technique [Sokołowski 2005]. Brache tried to discover a mathematical formula that would allow prediction of the location of celestial bodies. However, his works were
not successful. His student, Kepler, described this in the form of three simple equations, currently
known as Kepler’s laws.
Among others, Roy [1957] and Tukey [1962] wrote in the 1960s about the necessity of increasing the dynamics of searching for non-parametric methods, for which it would not be necessary
to meet burdensome assumptions.
Since that time, very dynamic development of data mining techniques has taken place, as well
as many definitions of this issue have been formed and modified over time. Berry, Linoff [1997,
2000] define data mining as a process of examining and analysing large amounts of data in order to
discover meaningful patterns and rules. Hand et al [2001], think that data mining is an analysis of
usually large, previously collected data sets in order to discover new regularities and describe data
in a new way, which is understandable and useful for their owner.
The process of data modelling using data mining proposed in the 1990s by representatives of
three companies (production from Germany, statistical from Great Britain and telecommunication
from Denmark) was adopted as a standard and covers six phases: a) understanding of the problem,
b) recognizing the data, c) preparing the data, d) modelling, e) evaluation of the model, f) implementation of the model.
238
Krzysztof Nęcka
In this technique, it is not necessary to know apriori the form of the theoretical model of the
examined phenomenon. It is also not necessary to meet the assumptions, but this does allow for modelling of very complex processes. The obtained models are usually very complex, and their interpretation is difficult or even impossible for man. They are analysed as black boxes, where input data is
introduced and a correct prediction is expected as the output. They also enable one to apply traditional
models, though usually issues related to the assumptions are analysed less restrictively. The quality
of built models is evaluated mostly by the accuracy of predictions counted in the test set. Because
of this, the techniques may be practically used to solve prediction issues [Demski 2007; Fijołek i in.
2010; Sokołowski 2007; Sokołowski, Pasztyła 2004; Tadeusiewicz 2006; Trojanowska, Nęcka 2009].
Transformation of the power sector into a market oriented economy in Poland, which took
place in 2007 [Dyrektywa 2003/54/WE], caused that electric energy is currently treated as a market product. Due to necessity of balancing demand and supply in real time, this is a very specific
product for which the increase of prediction quality is practical both for its suppliers and receivers.
Accurate evaluation of electric energy demand within individual hours of the day simply translates
into financial results for a company. If the receiver uses any other quantity of energy than planned
and ordered within a given hour, the generated difference must be balanced at the balancing market. However, discrepancies between the energy price in a negotiated contract and at the dynamic
balancing market may reach even one hundred per cent [Ciepiela 2007].
The goal of the paper was to built the most effective prediction system using the data mining
technique that would enable modelling the hourly electric energy demand in agricultural and food
industry plants due to the assumed quality measures.
MATERIAL AND METHODS
The goal of the paper was realized based on the results of own examinations performed using
the AS-3 grid parameter analyser within the period from May 2010 to April 2011. This consisted
in continuous measurement and recording of average 15-minute loads with active power and parameters characterizing the voltage quality in the transformer station supplying the District Dairy
Cooperative. The examined company is located in the southern part of Poland, within the area of
electric energy distribution provided by ENION GRUPA TAURON S.A. It is involved in the production of milk for consumption, dairy drinks, butter, cottage cheese and other dairy products, buying,
for production purposes, from 15 to 17 million litres of milk every year. It is supplied with electric
energy from a transformer and distribution station with 15/0.5kV of rated power 630 kVA, and the
total power of installed electric receivers is 330 kW.
By means of transfer of data from the AS-3 grid parameter analyser to the computer, a database was created in a form that allows its reading and further processing in the Statistica 9.1
program for statistical analysis of data.
EXAMINATION RESULTS
Analysis of load variability.
Realization of the goal of the paper was preceded by an analysis of load variability in the examined plant. It allowed for the collection of general information on the variability of the examined
time series, selection of potential control variables and discovery of the outliers.
USE OF DATA MINING TECHNIQUES FOR PREDICTING
239
At first, an analysis was made of the monthly variability of electric energy demand. The
course of changes in energy consumption within individual months is presented in the Fig. 1.
90
Monthly consumption
of electric energy [MWh]
80
70
60
50
40
30
20
10
2010
Name of month/Year
Aprill
March
February
January
December
November
October
September
August
July
June
May
0
2011
Fig. 1. Monthly consumption of electric energy
From the performed statistical analysis, the results show that monthly consumption of electric
energy changes within a range from ca. 60 MWh to 80 MWh. The calculated value of the correlation
coefficient has shown the existence of a statistically important dependence between the month and
energy consumption. Therefore, the month was accepted as the potential control variable.
The performed analysis of daily energy consumption similarity enabled separation of two
characteristic days of the week, i.e. the working day and the non-working day, showing the highest
similarity of load form [Nęcka 2011 a]. Therefore, it is advisable to divide the time series into two
sets, i.e. working days and non-working days and to create separate prediction models.
Fig. 2 presents the hourly variability of load within a winter and summer month.
Fig. 2. Course of hourly demand for electric energy during the winter month (January)
and the summer month (August)
240
Krzysztof Nęcka
In Fig. 2 as well as in the entire examined period, no outliers were recorded. However, the
seasonality of the electric energy demand is visible. In order to identify the seasonality, an autocorrelation analysis was performed examining the correlation between the values of the time series of
data separated from each other by k points ACF(k)=CF(x,x,k) (Fig. 3).
Fig. 3a presents the strong dependence between current electric energy consumption and its
value one hour earlier. The value of the correlation coefficient for these two measurements is 0.94.
Such a strong dependence between the variables enables one to assume that introduction at the input
of the model of a variable delayed by one hour results in estimates burdened with minimal error.
However, because of the specificity of the Polish energy market, the minimum value of delay that
can be used practically is 41 hours [Gładysz 2009].
a)
Delay Ratio of correlation
b)
Delay Ratio of correlation
Fig. 3. Autocorrelation graph: (a) for k = 24 (b) for k = 200
Fig. 3b presents the course of the autocorrelation function for delay k equal to 200 hours. This
shows sequences repeated in cycles, in which the amplitude decreases very slowly. The period of
the basic cycle is 24 hours. An increase of the correlation coefficient for a delay of 168 hours to the
value of 0.88 is clear, which indicates an occurance not only of daily seasonality, but also weekly.
Hence, it is justified to predict the hourly electric energy demand based on previous observations.
Moreover, results of the performed analyses [Nęcka 2011 b] showed that averaged profiles of
energy consumption within individual days of the week and parameters characterizing the voltage
quality delayed by one or a few seasonal periods are also statistically essential variables influencing
electric energy demand.
Construction of prediction models in the Data Miner operation environment
Before starting the construction of prediction models, the input data was divided to two sets.
The first set, being the teaching one, was created based on 18816 observations recorded within
the first period of examinations. The second set, the so-called test set, was created based on 2976
observations recorded within the last period of performed examinations. The time horizon for all
241
USE OF DATA MINING TECHNIQUES FOR PREDICTING
predictions developed was 48 hours due to the requirements put on participants of the electric
energy market in Poland.
To develop prediction models that allow for the determining of hourly electric energy demand,
a project was generated in the Statistica Data Miner graphic environment, of which an example for
working days is presented in Fig. 4.
Data
sources
Preparation, cleaning,
transformation of data
Data
analysis
Reports
Overall stepwise regression
Standard regression CRT
Exhaustive CHAID for regression
Amplified regression
Remove fixed
variables
Data division
trees
Artificial neuron networks (ANN)
Multivariate Adaptive Regression
p
(
)
Splines
(MARS).
Matched quality factor
Fig. 4. Project of prediction models generated in the Statistica 9.1 Data Miner
graphic environment for working days
The workspace in Statistica 9.1 Data Miner is divided into four sections that correspond to
the four stages of the Data Mining project:
• Data sources: Specification of location of files that include data to be used during the
process of further modelling.
• Preparation, cleaning, transformation of data: The node named Remove fixed variables
allows automatic elimination of input variables that assume one value only. The Data
division block is responsible for dividing data into a teaching and test set based on the
coding variable saved in the prepared input file.
• Data analysis: In the generated template of the design enabling prediction of hourly
electric energy demand, the following models were used: 1 - overall stepwise regression;
2 - standard regression CRT tree; 3 - exhaustive CHAID for regression; 4 - amplified
regression trees; 5 - artificial neuron networks (ANN); 6 - Multivariate Adaptive Regression Splines (MARS). A detailed description of individual models can be found in the
literature [Koronacki, Ćwik 2005; Hastie i in. 2001; Kot, Jakubowski, Sokołowski 2011;
Tadeusiewicz 1993].
Having started the project based on the selected methods, the program runs a proper number
of analyses and generates 5 models that allow for the prediction of hourly electric energy demand.
It also creates a new source of data for each model, which includes the results of its application for
the test set. Rapid quality evaluation of the generated models and their comparison based, among
other things, on the average quarterly error, average absolute error and correlation coefficient is
possible thanks to the Matched quality factor node.
• Reports: After completion of calculations, results summarizing individual analyses in the
form of sheets are placed in the workspace of Data Mining projects.
242
Krzysztof Nęcka
Evaluation of the predicting models
The problem of the quality of the developed prediction is one of the most important issues
related to prediction. Absolute average error, absolute average error modulus, absolute averagequarterly error, relative error of prediction, average relating percentage error, standard deviation of
error, Janus coefficient, Thiel coefficient are commonly used as measures of errors [Cieślak 1999;
Gajda 2001; Trojanowska, Nęcka 2009; Zeliaś i in. 2003].
Within the scope of evaluation of the developed predictions, their accuracy was checked (Tab.
1) by determining the following coefficients counted for the test set:
• Mean Error – ME:
ME =
•
1 n
⋅ ∑ ( Et − Et* ).
n i =1
Mean Absolute Error – MAE:
MAE =
•
(
(2)
Mean Absolute Percentage Error – MAPE:
*
1 n Et − Et
⋅∑
⋅100%.
n i =1 Et
(3)
Standard Deviation of Errors –SDE:
SDE =
•
)
1 n
⋅ ∑ Et − Et* .
n i =1
MAPE =
•
(1)
n
2
1
⋅ ∑ ( Et − Et* ) .
n − 1 i =1
(4)
Ratio of correlation – rEtE:
∑( E
n
rE E* =
t t
t
)(
− Et ⋅ Et* − Et*
i =1
∑( E
n
t
i =1
− Et
) ⋅ ∑(E
2
n
*
t
i =1
)
−E
*
t
)
2
;
where:
Et – actual value of electric energy consumption in hour t,
Et* – predicted value of electric energy consumption in hour t,
Et – mean actual value of electric energy consumption in hour t,
Et* – mean predicted value of electric energy consumption in hour t,
n – number of the last observation of the predicted variable.
(5)
243
USE OF DATA MINING TECHNIQUES FOR PREDICTING
Table 1. Comparison of the quality of predictions generated using the selected methods
Measure of error
Method
ME [kWh] MAE [kWh] MAPE [%]
SDE [kWh] rE E
t
*
t
Overall stepwise regression
-1,03
10,31
14,40
15,07
0,98
Standard regression CRT
-4,79
15,47
19,47
21,85
0,94
Exhaustive CHAID for regression
-2,33
12,02
16,61
17,13
0,97
Amplified regression trees
-2,74
12,29
16,52
17,41
0,97
Artificial neuron networks (ANN);
-1,39
10,98
15,40
16,77
0,97
Multivariate Adaptive Regression Splines
(MARS)
-1,19
10,31
14,23
15,25
0,98
Combined prognosis
-2,24
10,81
13,23
11,34
0,97
110
100
Relative frequency [%]
90
S
Standard
regression CRT Exhaustive
CHAID for regression
Amplified regression trees
ANN
Overall stepwise regression
MARS
Combined prognosis
80
70
60
50
40
30
20
10
0
0
10
20
30
40
50
60
70
80
90
100
MAPE [%]
Fig. 4. Cumulative distribution of prediction model errors for the test set
CONCLUSIONS
The lowest values of coefficients indicating the prediction error for teaching data was obtained in the case of the MARS model; however, the overall stepwise regression made it possible to
generate a prediction of comparable quality. In both cases, whilst predicting hourly electric energy
demand, we are wrong by ca. 14% on average, and the correlation coefficient value is very high
and amounts to 98%.
From the analysis of absolute average error results that all models generated within the examined period of time, overestimated the energy demand by 1 to almost 5 kWh on average. Whilst
predicting demand per a given hour of the day, we are mistaken by up to 10 kWh in the case of the
stepwise regression and MARS model, and by almost 16 kWh in the case of standard CRT trees.
244
Krzysztof Nęcka
One percentage reduction of the average relative error of prediction is obtained for the combined prognosis determined as an average value from individual models. A significant reduction of
errors with higher values is characteristic for this prediction.
In order to perform the analyses one more time, e.g. after a change of parameters of individual
models or after linking new data, it is sufficient to start the previously generated project template,
and the entire calculation process shall be completed automatically.
Monthly consumption of electric energy. Name of month/year. Number of observations. Measure of error. Relative frequency.
REFERENCES
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
Berry M., Linoff G. 1997.: Data Mining Techniques for Marketing, Sales and Customer Support, Wiley, Hoboken, NJ.
Berry M., Linoff G. 2000.: Mastering Data Mining, Wiley, Hoboken, NJ.
Ciepiela D. 2007.: Koszty bilansowania – zmora klienta. [online]. [dostęp 27–01–2011].
Dostępny w internecie: http://energetyka.wnp.pl/tpa/poradnik_jak_zmienic_dostawce_energii/koszty-bilansowania-zmora-klienta,3359_2_0_1.html.
Cieślak M. 1999.: Prognozowanie gospodarcze. Metody i zastosowanie. Wydawnictwo Naukowe PWN. Warszawa.
Demski T. 2007.: Przykład prognozowania z wykorzystaniem metod data mining. [online].
[dostęp 27–01–2011]. Dostępny w internecie: http://www.statsoft.pl/czytelnia/artykuly/
Przyklad_prognozowania.pdf.
Dyrektywa 2003/54/WE Parlamentu Europejskiego i Rady dotycząca wspólnych zasad dla
wewnętrznego rynku energii elektrycznej i uchylenia dyrektywę 96/92/WE oraz rozporządzeni
1228/2003 w sprawie warunków dostępu do sieci w transgranicznej wymianie energii elektrycznej.
Fijorek K., Mróz K., Niedziela K., Fijorek D. 2010.: Prognozowanie cen energii elektrycznej
na rynku dnia następnego metodami data mining, Rynek Energii, nr 12.
Hand D., Mannila H., Smyth P. 2001.: Principles of Data Mining, MIT Press, Cambridge.
Hastie T., Tibshirani R., Friedman J. 2001.: The Elements of Statistical Learning. Data mining, Inference and Prediction. Springer Verlag.
Gładysz B. 2009.: Metoda określania wielkości kontraktów na energię elektryczną. Badania
operacyjne i decyzje. Nr 3. Wrocław.
Gajda J. B. 2001.: Prognozowanie i symulacja a decyzje gospodarcze. Wydawnictwo C.H.
BECK. Warszawa.
Kot S., Jakubowski J., Sokołowski A. 2011.: Statystyka. Wydawnictwo DIFIN. ISBN 97883-7641-349-5.
Nęcka K. 2011a.: Analiza sezonowości obciążeń w zakładzie przemysłu rolno-spożywczego.
(Złożona do druku w Technika Rolnicza Ogrodnicza Leśna).
Nęcka K. 2011b.: Wykorzystanie modeli regresyjnych do prognozowania godzinowego zapotrzebowania na energię elektryczną w zakładzie przemysłu rolno-spożywczego. (Złożona
do druku w Technika Rolnicza Ogrodnicza Leśna).
Roy S. N. 1957.: Some Aspects of Multivariate Analysis, Wiley, New York.
Sokołowski A. 2007.: Data minind – automat czy metoda naukowa ? [online] [dostęp 2404-2011] Dostępny w Internecie: http://www.statsoft.pl/czytelnia/8_2007/Sokolowski05.pdf.
USE OF DATA MINING TECHNIQUES FOR PREDICTING
17.
18.
19.
20.
21.
22.
23.
245
Sokołowski A., Pasztyła A. 2004.: Data mining w prognozowaniu zapotrzebowania na
nośniki energii. [online]. [dostęp 27–04–2011]. Dostępny w Internecie: http://www.statsoft.
pl/czytelnia/prognozowanie/04energia.pdf.
Tadeusiewicz R. 2006.: Data mining jako szansa na relatywnie tanie dokonywanie odkryć
naukowych poprzez przekopywanie pozornie całkowicie wyeksploatowanych danych empirycznych. [online] [dostęp 24-04-2011] Dostępny w Internecie: http://www.statsoft.pl/
czytelnia/8_2007/Tadeusiewicz06.pdf.
Tadeusiewicz R. 1993.: Sieci neuronowe. Akademicka Oficyna Wydawnicza. Warszawa.
Trojanowska M., Nęcka K. 2009.: Wykorzystanie data mining do prognozowania obciążeń
szczytowych wiejskich stacji transformatorowych Мeждyнapoднaя Koнфepeнcия
ЭНEPГOOБECПEЧEНИE И БEЗOПACНOCТЂ. Wydawnictwo Orłowskiego Państwowego
Uniwersytetu Rolniczego OрeлГAУ, s.308-310, Orieł k/Moskwy (Rosja).
Trojanowska M., Nęcka K. 2009.: The use of alternative regression models to determine peak
loads in rural transformer stations. TEKA Komisji Motoryzacji i Energetyki Rolnictwa Vol.
IX. Lublin. s. 364-369.
Tukey J. 1962.: The Future of Statistics, Annals of Mathematical Statistics, 33, 1-67.
Zeliaś A. 1997.: Teoria prognozy. PWN. Warszawa.
WYKORZYSTANIE TECHNIK DATA MINING DO PROGNOZOWANIA
ZAPOTRZEBOWANIA NA ENERGIĘ ELEKTRYCZNĄ
Streszczenie. Opracowano w środowisku graficznym Data Miner projekt modeli predykcyjnych, pozwalających
na wyznaczanie godzinowego zapotrzebowania na energię elektryczną dla zakładu przemysłu rolno-spożywczego. Dzięki wykorzystaniu w nim technik data mining nie jest wymagana znajomość apriori postaci modelu
teoretycznego analizowanego zjawiska, nie jest również konieczne spełnienie założeń i możliwe jest modelowanie bardzo złożonych procesów.
Słowa kluczowe: data mining, prognoza, energia elektryczna, godzinowe zapotrzebowanie na energię.