Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining Applied on Grain Data Mart F. E. Correa1a, M. D. B. Oliveira2a, L. R. A. Alves3, J. Gama2b and P. L. P. Corrêa1b 1 Agricultural Automation Laboratory - Department of Computer Engineering - Polytechnic School – University of São Paulo, São Paulo, SP. POBox: 61548. ZIP 05424-970 - Brazil. 2 Laboratory of Artificial Intelligence and Decision Support – Inesc Porto LA, University of Porto, R. de Ceuta, 118, 6., ZIP: 4050-190, Porto, Portugal. 3 Center for Advanced Studies on Applied Economics, School of Agriculture “Luiz de Queiroz” – University of São Paulo, Piracicaba, SP. ZIP: 13400-970 - Brazil. 1a e-mail: [email protected] 1b e-mail: [email protected] 1a e-mail: [email protected] 1b e-mail: [email protected] 3 e-mail: [email protected] Abstract Agribusiness, as many other activities, produces huge amounts of spatio-temporal data. We need a system in order to store, analyze, and mine this data. In a previous work, we developed data warehouse tools to store, organize and query Brazilian agribusiness data from several regions along 10 years. In this paper, we go a step ahead, and propose specific data mining techniques to discover marks and evolution patterns from Agribusiness data. We propose the use of Tucker decomposition to automatically detect short time windows that exhibit large changes in the correlation structure between the time-series of prices from the Brazil Grain market. Keywords: Data mining, Tucker Decomposition, Correlation, Grain Market Introduction Agribusiness activity has the greatest influence in Brazil’s Economy and knowledge is necessary for an efficient administration to maintain competitiveness. Therefore, it is necessary to increase the number of researches and develop technology and innovation to extract information from Agribusiness data, which is more complex to analyze due to its specificities and large number of variables (or attributes). As a concern we can say that each segment in agribusiness has sufficient variables and data to make the work hard (King, 2009). Universities, research centers and some of Brazil’s governmental divisions have collected and maintained agribusiness data information for 20 years. This data can be used to understand the evolution of agribusiness sectors. These types of data have a lot of variables regarding time, because the databases are increasing so fast (IBGE, 2010). A few years ago, the problem was related to the lack of data to analyze. In contrast, nowadays the high volume of information turns its analysis into a time-consuming and cumbersome task. An example of this are the databases from Center of Advanced Studies on Applied 518 EFITA/WCCA ’11 Economics – Cepea which started to collect and maintain data about agri-economics of Brazil (Cepea, 2010). Following this line, Correa (2009) created an environment using Cepea databases and made possible to create models and tools for developed Data Marts toward Brazil Agribusiness. This environment also provides an easier access to the information by the specialists, due to the way data is organized. However, it is still difficult to select which part of the data can explain some events, or find patterns in a larger database, without resorting to automatic tools. Moreover, the information that is stored is time-series data, and agriculture has seasonal plantation crops or harvests, whose fluctuations are reflected on the collected data (e.g., prices variability within a year). Another aspect that could be considered is the introduction of geographic information related to the position of the agriculture’s regions. This is not only pertinent in the discovery of relevant data, such as costs of transportation, but also in the understanding of the influence of several variables, such as the distances between the regions, in the product prices negotiated on them. Bearing this in mind, we started a research using Data Mining techniques to explore and discover events in time-series and geographic data about agribusiness. This paper aims to show some preliminary results and compare them with facts that happened in the past. Our hypothesis is that the analysis of Tucker decompositions and correlations can help us in the automatic detection of events on the real world data. The Data Mart used is Brazil’s agri-economics, more specifically, the Grain Data Mart. Concepts and methods In this chapter some information about Grain Data Mart, Tucker decomposition, correlation and the adopted process are described. Grain Data Mart: The amount of data used in this work was obtained using OLAP tools. It is important to consider in a Data Mining process the use of data from a Data Warehouse/Data Mart in order to avoid problems, such as duplications or inconsistent data (Correa, 2009). For a better comprehension a short description of Data Mart variables is presented: • Products: not processed soybean and corn (commodities), negotiated per bag of 60 kilos. • Prices: are the average prices, expressed in Brazilian currency – real -, per bag of 60 kilos (R$-Sc/60 kg), practiced in negotiation (buy and sell) of market players for products into region reference, in producer and wholesaler (between companies) levels. • Producers’ prices: commercialization prices to producers, for products stored in warehouses of the regions where they were produced. • Wholesale prices: commercialization prices of the final product for the negotiation that is established between companies; these prices are clean, dry and standardized (export standard), and do not include Brazilian taxes. EFITA/WCCA ’11 519 • Region: group of cities where the product is produced and negotiated; a standard price is considered, which means it is the same for different cities. The Grain Data Mart contains 7 products, 15 regions, 5 measures and has more than 150 thousand records. The granularity of the data is divided in a hierarchy of regions (e.g. cities, regions and state) and time (e.g. day, month, quarter, semester and year). In this paper we only mine part of the Data Mart: we analyze the regions that have prices available for soybean and corn (7 regions) and we select the producers’ prices and wholesale prices, for each grain, as our measures (4 measures). In Figure there are depicted the studied regions, which are: Triângulo Mineiro (MG), Rio Verde (GO), Mogiana (SP), Sorocabana (SP), Norte do Paraná (PR), Passo Fundo (RS) and Sorriso (MT). The data is analyzed for a period that goes from Jan/2006 to Dec/2009. The chosen granularities are day, month and quarter. Figure 1. Geographical location of the Regions analyzed in this paper. Tucker Decomposition Tucker decomposition is a data analysis tool that is quite useful for data cleaning, data compression and visualization of the main structure of data in low-dimensional spaces. Tucker (Tucker, 1966) devised this method in order to extend the well-known PCA (Principal Component Analysis) to higher-order data representations, such as tensors. We can straightforward define a tensor as an extension of a matrix to three or more dimensions, or as a !-way data array, where ! is the order of the tensor. The Data Mart analyzed in this paper 520 EFITA/WCCA ’11 can be easily organized as a three-order tensor, since it incorporates the time dimension. The order, ways or modes of a tensor are synonyms and refer to the number of dimensions (in our case, we have three dimensions: regions, measures and time). For this specific type of tensors or, in other words, !-way data arrays for ! ! !, the most appropriate Tucker decomposition model is the so-called Tucker3 tensor decomposition (Tucker, 1966), which performs the reduction of data in all three modes of the tensor. The basic idea of the Tucker3 decomposition is to find a set of matrices and a small tensor that, in general, have less dimensionality than the original tensor, but are able to reconstruct the most important information contained in data. The Tucker3 model can be formulated as the factorization of the original three-order tensor !, such that ! ! ! !!"# ! !!"# !!" !!" !!" !!! !!! !!! for ! ! !! ! ! !, ! ! !! ! ! ! and ! ! !! ! ! !. Here, the coefficients !!" , !!" and !!" represent the entries of the component matrices ! ! ! !!!! , ! ! ! ! !!! and ! ! ! !!!! . In turn, the coefficient !!"# represents the entry of the so-called core tensor ! ! ! !!!!!! . The number of entities (i.e, the number of rows) in each mode are represented by letters !! ! and !. The number of components (i.e., the number of columns of the matrices !, ! and !) in the first, second and third mode of the tensor are represented by letters !! ! and !, respectively. The basic decomposition of three-order tensor, using the Tucker technique, is depicted in Figure 1. Figure 2. The basic Tucker3 tensor decomposition (Kolda and Bade, 2009) Tucker suggested interpreting the core tensor as describing the latent structure in data and the component matrices as mixing this structure to give the observed data. The core tensor can also be interpreted as a generalization of the eigenvalues of the SVD (Singular Value Decomposition), and it constitutes a further partitioning of the explained variation as is indicated by the eigenvalues of the standard PCA. Detailed information about Tucker3 technique can be found Tucker (1966), Skillicorn (2007) and Kolda et al. (2009). We resort to this technique in order to detect abnormal events and important marks in agribusiness data, for the considered time horizon. EFITA/WCCA ’11 521 Correlation Correlation is a measure of the relationship between pairs of entities and its value indicate the strength and direction of its linear relationship. Typically, this dependence between entities is measured using the Pearson’s correlation coefficient, which is obtained dividing the covariance of the two entities (usually, variables) by the product of their standard deviations and is denoted by !. When the values of both entities vary in the same direction (e.g., if ! increases, then ! increases, and vice versa), we say that entities ! and ! are positively correlated (! ! !). The strength of this relationship is defined by the magnitude of !, which reaches its maximum when ! ! ! – perfect positive correlation. On the other hand, when the values of both entities vary in opposite directions (for instance, when ! increases, then ! decreases), the entities are negatively correlated (! ! !). Similarly, the correlation coefficient attains its minimum at !!, which indicates a perfect negative correlation. It should be noted that a zero value of the Pearson’s correlation coefficient (! ! !) does not imply an absence of relationship between both entities. This outcome can only indicate the inexistence of a linear relationship between the analyzed entities (Hoffmann et al., 1977). In this paper, we will use the analysis of correlation as a complementary technique of Tucker3 tensor decomposition results. The reason behind this is the fact that Tucker3 only allow the detection of irregularities in data, and does not provide deeper information about the causes of abnormal behavior, which can be explored in practice computing correlations. Application process As mentioned before, the chosen Data Mart is composed by 7 regions, 4 measures (averages prices of soybean and corn) and a total of 4 years. Time was split in blocks, more specifically, in 16 blocks (every quarter) and 4 blocks (each year), in order to apply Tucker3 techniques. To conduct the experiments described in this paper we resort to the N-way toolbox for MatLab. We developed a batch to automatic save the results and the plots returned by the functions and we got 20 results for analysis. In order to apply the Tucker3 decomposition first was necessary to define the modes that would be used. We choose the regions, measures and time as the modes of a three-order tensor. The outputs are presented in plots defined by the previously mentioned modes. To help the interpretation of the results we compute the correlations for every quarter and for every combination of pairs of regions, fixing a given measure and a given product. For instance, we compute the correlation between the pair of regions !"##$%"!!"#!!"#$", regarding the price of the corn; then we compute again the correlation between another pair of regions (e.g !"##$%"!!"#$%!!"#"$!) for the same measure, and so on. 522 EFITA/WCCA ’11 Results In order to help the decision of which plot to present, we compared the results returned for the analysis using the quarter time blocks and the year time blocks. We concluded that both analyses obtained very close results. Therefore, in this study we only present plots regarding the years (results returned by applying Tucker3 decomposition to annual data). To understand and interpret the results the first step is to analyze the variation explained by each combination of components (components are the columns of each component matrix) that are shown in Table 1. This table comprises information regarding the year and the combination of components (or factors) that explain the most variation of the original tensor. Table 1. Combination of components with higher explanatory power, for each analyzed year Each number of a specific combination of the column Factor refers to the number of the component (or column) of each component matrix, which, in turn, has compressed information about each mode of the original tensor.(regions, measures and time). To give an example: for the time block of year 2008, the combination (1,1,1), ie, the first component of mode regions, the first component of mode measures and the first component of mode time, explains 55.88% of the variation contained in the original data mart. The combinations of components with higher percentage are the most relevant ones, so the information provided by the Table give us an idea where we should focus our analysis. Another output of the Tucker3 technique is the component matrices and the core tensor. The number of rows of each component matrix corresponds to the number of regions, and the number of columns represent the number of components (for the sake of interpretation, we can make an analogy to the principal components returned by PCA), that can vary for each mode. Each entry of the matrix has associated a score, which value is within the interval !!!!!!. In order to find interesting events, we look for regions with scores closer to extreme values (-1 or 1). Scores closer to zero are less interesting since they represent stable data. We graphically represent the component matrices in a way that we can easily detect extreme scores. In X-axis are represented the ID of each object in a mode (e.g., in mode regions, the Xaxis corresponds to the ID of regions) and in Y-axis the scores are represented. In Figure the contents of the component matrix of mode regions are illustrated. Each subplot corresponds to a one of the four blocks of time (one per year from 2006 to 2009). The labels of the regions in the X-axis are: 1 - Triangulo Mineiro (MG); 2 – Rio Verde (GO); 3 – Mogiana (SP); 4 – Norte do Paraná (PR); 5 – Passo Fundo (RS); 6 – Sorocabana (SP), and 7 – Sorriso (MT). EFITA/WCCA ’11 523 (a) Regions plot. (b) Years (time) plot. Figure 3. Component Matrices plot. The plots are highlighted with a circle to draw attention for extremes scores that identify the interesting events. Proceeding the analysis for the 1st factor (blue spots), we detect the following events: in year 2006 Sorriso region is associated with a high negative score (score:-0.8); in 2007 this region presents a score of -0.6, which is not as high as the previous one but still can be considered an extreme score. Regarding the mode measures, the results follow the same tendency of Sorriso Region for all analyzed years. It is noteworthy that, during years 2008 and 2009, the scores for Sorriso region invert (from negative to positive) and the values became close to 1. On the other hand, the remaining six regions have associated opposite scores or values very close to 0. Based in this informatio it can be inferred that Sorriso region has opposite behavior compared to the behavior of all other regions. Similarly to Figure -a, Figure -b illustrates the scores for each component of mode time (represented in blocks of years). The analysis procedure is the same as the one adopted in the previous mode. It is possible to see that, on October and December of 2007, and for the 1st component (blue line), the scores of time present a descending trend (the scores became even more lower through time). Regarding the 3rd component (red line), this one is characterized by a great variability along different months. However, since this component has a low representativeness of the original data (1.78%), as can be verified in Table 1, won’t be considered for further analysis. In contrast with 2007, and for the same months described before, the 1st component (blue line) of year 2008 presents a significant change to positive scores and the 2nd component (green line) of year 2008 a remarkable change to negative scores. 524 EFITA/WCCA ’11 During 2009 the 1st component is maintained stable (scores close to 0), so we do not consider it interesting within the purpose of our analysis. However, the 2nd component (green line) presents positive scores for May/June and negative scores in the end-of-the-year (November and December). As a complement of the Extreme Scores Analysis of Tucker3 tensor decomposition, we compute the correlations using the procedure explained in the subsection Application Process. These will be computed for each quarter of years 2007 and 2009, which are the years that present special behavior according to the previous findings. We focus on quarters so we can have access to more granular information. Since Tucker3 plots for mode regions highlighted Sorriso, this region was fixed for the computation of correlations (combinations of region Sorriso with all other regions) for the corn product prices. The resulting plots are depicted in Figure -a. the graphical analysis allows us to draw some interesting facts, namely:, the fact that Sorriso region has negative correlation on the first semester of 2007 and positive correlation on the last semester of this year. The importance of this event is to infer that, on the first semester, almost every regions had an opposite trend comparing with the Sorriso region. This means that, for instance, when prices of Sorocabana region (6) increase the prices of Sorriso region (7) decrease or vice versa. However, during the second semester the Sorriso region presents the same behavior as other regions or, in other words, if the prices of one of the regions decrease, then the prices of Sorriso region also decrease (the opposite is also true). (a) Correlation for 2007. (b) Correlation for 2009. Figure 4. Correlation between Sorriso region and all other regions, with respect to Corn prices For the plot of mode time (years), returned by Tucker3 decomposition and presented in Figure -b, May and June of year 2009 were highlighted. Performing the same tests based on correlations, that were described earlier, and presented in Figure we conclude that only in this quarter the correlation is negative. The information provided by both analysis – Extreme Scores Analysis based on Tucker3 tensor decomposition and Correlation analysis - unveiled interesting facts about Brazilian Agribusiness. First of all, it is known by the domain experts, that these two products - Soybean and Corn - compete for crop fields in Brazil. To have an idea, the total grain production for EFITA/WCCA ’11 525 Mato Grosso State and for crop year 2007/2008, was more than 25 million ton. (17 million for Soybean and 8 million ton for Corn). Sorriso Region is the biggest producer of Soybean (the production for year 2009 was R$ 1.33 billion, or U.S. $775 million dollars), (IBGE, 2010). The second interesting fact detected by the performed analysis, is related to the Corn production from 2006 until the first semester of 2007, which was unexpressive. Because of that the prices didn’t follow the variations of the other regions showed in this paper. However, during the second semester of 2007 Mato Grosso State (that include Sorriso region) had an increase of Corn production, for 4 million ton in 2005/2006 to 7.8 million ton on last semester in 2007/2008, which turned this State to an important producer of Corn and, consequently, as a dictator of price trends to other regions (Conab, 2010). The third important event that was captured using these techniques is related to the fact that, during the first semester of 2009, the Corn prices were unstable; nevertheless, the prices in Sorriso region became more stable. This is due to the fact that, in general, the volume of crop fields and their prices are more stable on the second quarter, in Sorriso region, than in the other regions, for the same period. Conclusions In this paper we propose the use of two complementary techniques – Tucker3 tensor decomposition and Correlation Analysis – as means of gaining insight about abnormal and important events related to the Brazilian Agribusiness. The proposed process to analyze large amounts of data that arise from Agriculture activities, allowed us to draw some interesting conclusions that, otherwise, would be hard to find. Through the detection of irregularities in data, these techniques can help the specialists focus and concentrate efforts in specific events occurred in the past. These are, in general, the representative events that Tucker3 technique highlights. Therefore, the specialists free more of their time in the understanding and characterization of the evolution patterns of agriculture activity, since it is easier to select information and connect events in time. As future work we intend to improve the analysis and the tools that were used and conduct experiments with different entities (products and regions) and variables. Furthermore, we are considering the use of sliding windows to help in the detection of events. The idea is to understand how the granularity of the time window affects the results. To investigate this we choose a given period of time, e.g. a quarter of the year, and compute the Tucker3 tensor decomposition in this block of data. After, we select another block of data (e.g. 15 days ahead) and repeat the process. Once we have the results for each time window, it is possible to make a movie linking the sequence of plots and analyze the significant changes or events in a dynamic and intuitive way for the user. Acknowlegments Center of Advanced Studies on Applied Economics (CEPEA) – USP/Brazil Laboratory of Artificial Intelligence and Decision Support (LIAAD) – UP/Portugal 526 EFITA/WCCA ’11 Agricultural Automation Laboratory (LAA) – USP/Brazil References Cepea. 2010. Center of Advanced Studies on Applied Economics. Available at: http://www.cepea.esalq.usp.br. Conab. 2010. National Food Supply Company. Available at: http://www.conab.gov.br. Correa, F. E., Corrêa, P. L. P., Almeida Jr., J. R., Alves, L. R. A., Saraiva, A. M. 2009, Data warehouse for soybean and corn market on Brazil. Joint International Agricultural Conference. Hoffmann, R., Vieira, S. 1987. Análise de Regressão, Uma Introdução à Econometria - Regression Analysis, Econometric Introduction, Ed. Hucitec, São Paulo. IBGE. 2010. Brazilian Institute of Geography and Statistic. Available at: http://www.ibge.gov.br. King, R. P., Boehlje, M., Cook, M., L., Sonka, S. T., 2010. Agribusiness Economics and Management, American Journal of Agricultural Economics, Oxford Journal, 92: 554-570. Kolda, T. G. and Bader, B. W., 2009. Tensor decompositions and applications. SIAM Review. vol. 51, n 3, pp. 455-500 Skillicorn, D., 2007. Understanding Complex Datasets: Data Mining with Matrix Decompositions, Chapman and Hall/CRC, New York - USA Tucker, L. R., 1966. "Some mathematical notes on three-mode factor analysis", Psychometrika, vol. 31, pp. 279. EFITA/WCCA ’11 527