Download Data Mining Applied on Grain Data Mart FE Correa1a, MDB

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Principal component analysis wikipedia , lookup

Transcript
Data Mining Applied on Grain Data Mart
F. E. Correa1a, M. D. B. Oliveira2a, L. R. A. Alves3, J. Gama2b and P. L. P. Corrêa1b
1
Agricultural Automation Laboratory - Department of Computer Engineering - Polytechnic
School – University of São Paulo, São Paulo, SP. POBox: 61548. ZIP 05424-970 - Brazil.
2
Laboratory of Artificial Intelligence and Decision Support – Inesc Porto LA, University of Porto,
R. de Ceuta, 118, 6., ZIP: 4050-190, Porto, Portugal.
3
Center for Advanced Studies on Applied Economics, School of Agriculture “Luiz de Queiroz” –
University of São Paulo, Piracicaba, SP. ZIP: 13400-970 - Brazil.
1a
e-mail: [email protected]
1b
e-mail: [email protected]
1a
e-mail: [email protected]
1b
e-mail: [email protected]
3
e-mail: [email protected]
Abstract
Agribusiness, as many other activities, produces huge amounts of spatio-temporal data. We
need a system in order to store, analyze, and mine this data. In a previous work, we developed
data warehouse tools to store, organize and query Brazilian agribusiness data from several
regions along 10 years. In this paper, we go a step ahead, and propose specific data mining
techniques to discover marks and evolution patterns from Agribusiness data. We propose the
use of Tucker decomposition to automatically detect short time windows that exhibit large
changes in the correlation structure between the time-series of prices from the Brazil Grain
market.
Keywords: Data mining, Tucker Decomposition, Correlation, Grain Market
Introduction
Agribusiness activity has the greatest influence in Brazil’s Economy and knowledge is
necessary for an efficient administration to maintain competitiveness. Therefore, it is
necessary to increase the number of researches and develop technology and innovation to
extract information from Agribusiness data, which is more complex to analyze due to its
specificities and large number of variables (or attributes). As a concern we can say that each
segment in agribusiness has sufficient variables and data to make the work hard (King, 2009).
Universities, research centers and some of Brazil’s governmental divisions have collected and
maintained agribusiness data information for 20 years. This data can be used to understand
the evolution of agribusiness sectors. These types of data have a lot of variables regarding time,
because the databases are increasing so fast (IBGE, 2010).
A few years ago, the problem was related to the lack of data to analyze. In contrast, nowadays
the high volume of information turns its analysis into a time-consuming and cumbersome
task. An example of this are the databases from Center of Advanced Studies on Applied
518
EFITA/WCCA ’11
Economics – Cepea which started to collect and maintain data about agri-economics of Brazil
(Cepea, 2010).
Following this line, Correa (2009) created an environment using Cepea databases and made
possible to create models and tools for developed Data Marts toward Brazil Agribusiness. This
environment also provides an easier access to the information by the specialists, due to the way
data is organized. However, it is still difficult to select which part of the data can explain some
events, or find patterns in a larger database, without resorting to automatic tools.
Moreover, the information that is stored is time-series data, and agriculture has seasonal
plantation crops or harvests, whose fluctuations are reflected on the collected data (e.g., prices
variability within a year). Another aspect that could be considered is the introduction of
geographic information related to the position of the agriculture’s regions. This is not only
pertinent in the discovery of relevant data, such as costs of transportation, but also in the
understanding of the influence of several variables, such as the distances between the regions,
in the product prices negotiated on them.
Bearing this in mind, we started a research using Data Mining techniques to explore and
discover events in time-series and geographic data about agribusiness.
This paper aims to show some preliminary results and compare them with facts that happened
in the past. Our hypothesis is that the analysis of Tucker decompositions and correlations can
help us in the automatic detection of events on the real world data. The Data Mart used is
Brazil’s agri-economics, more specifically, the Grain Data Mart.
Concepts and methods
In this chapter some information about Grain Data Mart, Tucker decomposition, correlation
and the adopted process are described.
Grain Data Mart:
The amount of data used in this work was obtained using OLAP tools. It is important to
consider in a Data Mining process the use of data from a Data Warehouse/Data Mart in order
to avoid problems, such as duplications or inconsistent data (Correa, 2009).
For a better comprehension a short description of Data Mart variables is presented:
• Products: not processed soybean and corn (commodities), negotiated per bag of 60 kilos.
• Prices: are the average prices, expressed in Brazilian currency – real -, per bag of 60 kilos
(R$-Sc/60 kg), practiced in negotiation (buy and sell) of market players for products into
region reference, in producer and wholesaler (between companies) levels.
• Producers’ prices: commercialization prices to producers, for products stored in
warehouses of the regions where they were produced.
• Wholesale prices: commercialization prices of the final product for the negotiation that is
established between companies; these prices are clean, dry and standardized (export
standard), and do not include Brazilian taxes.
EFITA/WCCA ’11
519
•
Region: group of cities where the product is produced and negotiated; a standard price is
considered, which means it is the same for different cities.
The Grain Data Mart contains 7 products, 15 regions, 5 measures and has more than 150
thousand records. The granularity of the data is divided in a hierarchy of regions (e.g. cities,
regions and state) and time (e.g. day, month, quarter, semester and year).
In this paper we only mine part of the Data Mart: we analyze the regions that have prices
available for soybean and corn (7 regions) and we select the producers’ prices and wholesale
prices, for each grain, as our measures (4 measures).
In
Figure there are depicted the studied regions, which are: Triângulo Mineiro (MG), Rio Verde
(GO), Mogiana (SP), Sorocabana (SP), Norte do Paraná (PR), Passo Fundo (RS) and Sorriso
(MT). The data is analyzed for a period that goes from Jan/2006 to Dec/2009. The chosen
granularities are day, month and quarter.
Figure 1. Geographical location of the Regions analyzed in this paper.
Tucker Decomposition
Tucker decomposition is a data analysis tool that is quite useful for data cleaning, data
compression and visualization of the main structure of data in low-dimensional spaces.
Tucker (Tucker, 1966) devised this method in order to extend the well-known PCA (Principal
Component Analysis) to higher-order data representations, such as tensors. We can
straightforward define a tensor as an extension of a matrix to three or more dimensions, or as
a !-way data array, where ! is the order of the tensor. The Data Mart analyzed in this paper
520
EFITA/WCCA ’11
can be easily organized as a three-order tensor, since it incorporates the time dimension. The
order, ways or modes of a tensor are synonyms and refer to the number of dimensions (in our
case, we have three dimensions: regions, measures and time). For this specific type of tensors
or, in other words, !-way data arrays for ! ! !, the most appropriate Tucker decomposition
model is the so-called Tucker3 tensor decomposition (Tucker, 1966), which performs the
reduction of data in all three modes of the tensor.
The basic idea of the Tucker3 decomposition is to find a set of matrices and a small tensor
that, in general, have less dimensionality than the original tensor, but are able to reconstruct
the most important information contained in data.
The Tucker3 model can be formulated as the factorization of the original three-order tensor !,
such that
!
!
!
!!"# !
!!"# !!" !!" !!"
!!! !!! !!!
for ! ! !! ! ! !, ! ! !! ! ! ! and ! ! !! ! ! !. Here, the coefficients !!" , !!" and !!" represent
the entries of the component matrices ! ! ! !!!! , ! ! ! ! !!! and ! ! ! !!!! . In turn, the
coefficient !!"# represents the entry of the so-called core tensor ! ! ! !!!!!! . The number of
entities (i.e, the number of rows) in each mode are represented by letters !! ! and !. The
number of components (i.e., the number of columns of the matrices !, ! and !) in the first,
second and third mode of the tensor are represented by letters !! ! and !, respectively. The
basic decomposition of three-order tensor, using the Tucker technique, is depicted in Figure 1.
Figure 2. The basic Tucker3 tensor decomposition (Kolda and Bade, 2009)
Tucker suggested interpreting the core tensor as describing the latent structure in data and the
component matrices as mixing this structure to give the observed data. The core tensor can also
be interpreted as a generalization of the eigenvalues of the SVD (Singular Value
Decomposition), and it constitutes a further partitioning of the explained variation as is
indicated by the eigenvalues of the standard PCA. Detailed information about Tucker3
technique can be found Tucker (1966), Skillicorn (2007) and Kolda et al. (2009).
We resort to this technique in order to detect abnormal events and important marks in
agribusiness data, for the considered time horizon.
EFITA/WCCA ’11
521
Correlation
Correlation is a measure of the relationship between pairs of entities and its value indicate the
strength and direction of its linear relationship. Typically, this dependence between entities is
measured using the Pearson’s correlation coefficient, which is obtained dividing the covariance
of the two entities (usually, variables) by the product of their standard deviations and is
denoted by !.
When the values of both entities vary in the same direction (e.g., if ! increases, then !
increases, and vice versa), we say that entities ! and ! are positively correlated (! ! !). The
strength of this relationship is defined by the magnitude of !, which reaches its maximum
when ! ! ! – perfect positive correlation. On the other hand, when the values of both
entities vary in opposite directions (for instance, when ! increases, then ! decreases), the
entities are negatively correlated (! ! !).
Similarly, the correlation coefficient attains its minimum at !!, which indicates a perfect
negative correlation. It should be noted that a zero value of the Pearson’s correlation
coefficient (! ! !) does not imply an absence of relationship between both entities. This
outcome can only indicate the inexistence of a linear relationship between the analyzed entities
(Hoffmann et al., 1977).
In this paper, we will use the analysis of correlation as a complementary technique of Tucker3
tensor decomposition results. The reason behind this is the fact that Tucker3 only allow the
detection of irregularities in data, and does not provide deeper information about the causes of
abnormal behavior, which can be explored in practice computing correlations.
Application process
As mentioned before, the chosen Data Mart is composed by 7 regions, 4 measures (averages
prices of soybean and corn) and a total of 4 years. Time was split in blocks, more specifically,
in 16 blocks (every quarter) and 4 blocks (each year), in order to apply Tucker3 techniques.
To conduct the experiments described in this paper we resort to the N-way toolbox for
MatLab. We developed a batch to automatic save the results and the plots returned by the
functions and we got 20 results for analysis.
In order to apply the Tucker3 decomposition first was necessary to define the modes that
would be used. We choose the regions, measures and time as the modes of a three-order
tensor.
The outputs are presented in plots defined by the previously mentioned modes. To help the
interpretation of the results we compute the correlations for every quarter and for every
combination of pairs of regions, fixing a given measure and a given product. For instance, we
compute the correlation between the pair of regions !"##$%"!!"#!!"#$", regarding the price
of the corn; then we compute again the correlation between another pair of regions (e.g
!"##$%"!!"#$%!!"#"$!) for the same measure, and so on.
522
EFITA/WCCA ’11
Results
In order to help the decision of which plot to present, we compared the results returned for the
analysis using the quarter time blocks and the year time blocks. We concluded that both
analyses obtained very close results. Therefore, in this study we only present plots regarding
the years (results returned by applying Tucker3 decomposition to annual data).
To understand and interpret the results the first step is to analyze the variation explained by
each combination of components (components are the columns of each component matrix)
that are shown in Table 1. This table comprises information regarding the year and the
combination of components (or factors) that explain the most variation of the original tensor.
Table 1. Combination of components with higher explanatory power, for each analyzed year
Each number of a specific combination of the column Factor refers to the number of the
component (or column) of each component matrix, which, in turn, has compressed
information about each mode of the original tensor.(regions, measures and time). To give an
example: for the time block of year 2008, the combination (1,1,1), ie, the first component of
mode regions, the first component of mode measures and the first component of mode time,
explains 55.88% of the variation contained in the original data mart.
The combinations of components with higher percentage are the most relevant ones, so the
information provided by the Table give us an idea where we should focus our analysis.
Another output of the Tucker3 technique is the component matrices and the core tensor. The
number of rows of each component matrix corresponds to the number of regions, and the
number of columns represent the number of components (for the sake of interpretation, we
can make an analogy to the principal components returned by PCA), that can vary for each
mode. Each entry of the matrix has associated a score, which value is within the interval
!!!!!!. In order to find interesting events, we look for regions with scores closer to extreme
values (-1 or 1). Scores closer to zero are less interesting since they represent stable data.
We graphically represent the component matrices in a way that we can easily detect extreme
scores. In X-axis are represented the ID of each object in a mode (e.g., in mode regions, the Xaxis corresponds to the ID of regions) and in Y-axis the scores are represented. In
Figure the contents of the component matrix of mode regions are illustrated. Each subplot
corresponds to a one of the four blocks of time (one per year from 2006 to 2009). The labels of
the regions in the X-axis are: 1 - Triangulo Mineiro (MG); 2 – Rio Verde (GO); 3 – Mogiana
(SP); 4 – Norte do Paraná (PR); 5 – Passo Fundo (RS); 6 – Sorocabana (SP), and 7 – Sorriso
(MT).
EFITA/WCCA ’11
523
(a) Regions plot.
(b) Years (time) plot.
Figure 3. Component Matrices plot.
The plots are highlighted with a circle to draw attention for extremes scores that identify the
interesting events.
Proceeding the analysis for the 1st factor (blue spots), we detect the following events: in year
2006 Sorriso region is associated with a high negative score (score:-0.8); in 2007 this region
presents a score of -0.6, which is not as high as the previous one but still can be considered an
extreme score. Regarding the mode measures, the results follow the same tendency of Sorriso
Region for all analyzed years. It is noteworthy that, during years 2008 and 2009, the scores for
Sorriso region invert (from negative to positive) and the values became close to 1.
On the other hand, the remaining six regions have associated opposite scores or values very
close to 0. Based in this informatio it can be inferred that Sorriso region has opposite behavior
compared to the behavior of all other regions.
Similarly to
Figure -a,
Figure -b illustrates the scores for each component of mode time (represented in blocks of
years). The analysis procedure is the same as the one adopted in the previous mode.
It is possible to see that, on October and December of 2007, and for the 1st component (blue
line), the scores of time present a descending trend (the scores became even more lower
through time). Regarding the 3rd component (red line), this one is characterized by a great
variability along different months. However, since this component has a low
representativeness of the original data (1.78%), as can be verified in Table 1, won’t be
considered for further analysis.
In contrast with 2007, and for the same months described before, the 1st component (blue line)
of year 2008 presents a significant change to positive scores and the 2nd component (green
line) of year 2008 a remarkable change to negative scores.
524
EFITA/WCCA ’11
During 2009 the 1st component is maintained stable (scores close to 0), so we do not consider
it interesting within the purpose of our analysis. However, the 2nd component (green line)
presents positive scores for May/June and negative scores in the end-of-the-year (November
and December).
As a complement of the Extreme Scores Analysis of Tucker3 tensor decomposition, we
compute the correlations using the procedure explained in the subsection Application Process.
These will be computed for each quarter of years 2007 and 2009, which are the years that
present special behavior according to the previous findings. We focus on quarters so we can
have access to more granular information.
Since Tucker3 plots for mode regions highlighted Sorriso, this region was fixed for the
computation of correlations (combinations of region Sorriso with all other regions) for the
corn product prices. The resulting plots are depicted in
Figure -a. the graphical analysis allows us to draw some interesting facts, namely:, the fact that
Sorriso region has negative correlation on the first semester of 2007 and positive correlation
on the last semester of this year.
The importance of this event is to infer that, on the first semester, almost every regions had an
opposite trend comparing with the Sorriso region. This means that, for instance, when prices
of Sorocabana region (6) increase the prices of Sorriso region (7) decrease or vice versa.
However, during the second semester the Sorriso region presents the same behavior as other
regions or, in other words, if the prices of one of the regions decrease, then the prices of
Sorriso region also decrease (the opposite is also true).
(a) Correlation for 2007.
(b) Correlation for 2009.
Figure 4. Correlation between Sorriso region and all other regions, with respect to Corn prices
For the plot of mode time (years), returned by Tucker3 decomposition and presented in
Figure -b, May and June of year 2009 were highlighted. Performing the same tests based on
correlations, that were described earlier, and presented in
Figure we conclude that only in this quarter the correlation is negative.
The information provided by both analysis – Extreme Scores Analysis based on Tucker3 tensor
decomposition and Correlation analysis - unveiled interesting facts about Brazilian
Agribusiness. First of all, it is known by the domain experts, that these two products - Soybean
and Corn - compete for crop fields in Brazil. To have an idea, the total grain production for
EFITA/WCCA ’11
525
Mato Grosso State and for crop year 2007/2008, was more than 25 million ton. (17 million for
Soybean and 8 million ton for Corn). Sorriso Region is the biggest producer of Soybean (the
production for year 2009 was R$ 1.33 billion, or U.S. $775 million dollars), (IBGE, 2010).
The second interesting fact detected by the performed analysis, is related to the Corn
production from 2006 until the first semester of 2007, which was unexpressive. Because of that
the prices didn’t follow the variations of the other regions showed in this paper. However,
during the second semester of 2007 Mato Grosso State (that include Sorriso region) had an
increase of Corn production, for 4 million ton in 2005/2006 to 7.8 million ton on last semester
in 2007/2008, which turned this State to an important producer of Corn and, consequently, as
a dictator of price trends to other regions (Conab, 2010).
The third important event that was captured using these techniques is related to the fact that,
during the first semester of 2009, the Corn prices were unstable; nevertheless, the prices in
Sorriso region became more stable. This is due to the fact that, in general, the volume of crop
fields and their prices are more stable on the second quarter, in Sorriso region, than in the
other regions, for the same period.
Conclusions
In this paper we propose the use of two complementary techniques – Tucker3 tensor
decomposition and Correlation Analysis – as means of gaining insight about abnormal and
important events related to the Brazilian Agribusiness. The proposed process to analyze large
amounts of data that arise from Agriculture activities, allowed us to draw some interesting
conclusions that, otherwise, would be hard to find. Through the detection of irregularities in
data, these techniques can help the specialists focus and concentrate efforts in specific events
occurred in the past. These are, in general, the representative events that Tucker3 technique
highlights. Therefore, the specialists free more of their time in the understanding and
characterization of the evolution patterns of agriculture activity, since it is easier to select
information and connect events in time.
As future work we intend to improve the analysis and the tools that were used and conduct
experiments with different entities (products and regions) and variables.
Furthermore, we are considering the use of sliding windows to help in the detection of events.
The idea is to understand how the granularity of the time window affects the results. To
investigate this we choose a given period of time, e.g. a quarter of the year, and compute the
Tucker3 tensor decomposition in this block of data. After, we select another block of data (e.g.
15 days ahead) and repeat the process. Once we have the results for each time window, it is
possible to make a movie linking the sequence of plots and analyze the significant changes or
events in a dynamic and intuitive way for the user.
Acknowlegments
Center of Advanced Studies on Applied Economics (CEPEA) – USP/Brazil
Laboratory of Artificial Intelligence and Decision Support (LIAAD) – UP/Portugal
526
EFITA/WCCA ’11
Agricultural Automation Laboratory (LAA) – USP/Brazil
References
Cepea. 2010. Center of Advanced Studies on Applied Economics. Available at:
http://www.cepea.esalq.usp.br.
Conab. 2010. National Food Supply Company. Available at: http://www.conab.gov.br.
Correa, F. E., Corrêa, P. L. P., Almeida Jr., J. R., Alves, L. R. A., Saraiva, A. M. 2009, Data warehouse for
soybean and corn market on Brazil. Joint International Agricultural Conference.
Hoffmann, R., Vieira, S. 1987. Análise de Regressão, Uma Introdução à Econometria - Regression
Analysis, Econometric Introduction, Ed. Hucitec, São Paulo.
IBGE. 2010. Brazilian Institute of Geography and Statistic. Available at: http://www.ibge.gov.br.
King, R. P., Boehlje, M., Cook, M., L., Sonka, S. T., 2010. Agribusiness Economics and Management,
American Journal of Agricultural Economics, Oxford Journal, 92: 554-570.
Kolda, T. G. and Bader, B. W., 2009. Tensor decompositions and applications. SIAM Review. vol. 51, n 3,
pp. 455-500
Skillicorn, D., 2007. Understanding Complex Datasets: Data Mining with Matrix Decompositions,
Chapman and Hall/CRC, New York - USA
Tucker, L. R., 1966. "Some mathematical notes on three-mode factor analysis", Psychometrika, vol.
31, pp. 279.
EFITA/WCCA ’11
527