Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Catastrophic interference wikipedia , lookup
Cross-validation (statistics) wikipedia , lookup
The Measure of a Man (Star Trek: The Next Generation) wikipedia , lookup
Mathematical model wikipedia , lookup
Pattern recognition wikipedia , lookup
Data (Star Trek) wikipedia , lookup
Proc. of Int. Conf. on Advances in Computer Science 2010 Neural Network Approach to Predict Quality of Data Warehouse Multidimensional Model Anjana Gosain1 , Sangeeta Sabharwal2,Sushama Nagpal3 1 School of Information Technology GGSIPU, New Delhi-110006 2 Computer Engineering Division Netaji Subhas Institute of Technology Dwarka, New Delhi-110078 3 Computer Engineering Division Netaji Subhas Institute of Technology Dwarka, New Delhi-110078 quality. In this paper it is proposed to use neural network approach to predict the quality of multidimensional model for data warehouse. It is observed that ANN is able to successfully model the complex and non linear relationships between the quality metrics proposed by [3] and understandability of data warehouse multidimensional model, which is one of the quality factors of maintainability. The Organization of the paper is as follows: Section II discusses the related work; Section III discusses objective and motivation for doing this work; Section IV briefly describes the proposed metrics for data warehouse schema. In Section V, architecture of proposed neural network is explained. Section VI discusses the experiment conducted and analysis of the results. Section VII summarizes the work and presents conclusion. Abstract - Estimating the quality of data warehouse at an early stage is an important task. The quality of data warehouse depends on the data quality and data warehouse multidimensional model. Very little work is done to assess the quality of multidimensional model objectively using metrics. Some statistical techniques like correlation analysis, univariate and multivariate regression techniques, etc. have been used which indicated that these metrics are significantly related to the quality of multidimensional models for data warehouse. However, in context of data warehouse, very little work has been done to predict the multidimensional model quality using machine learning techniques. In this paper, we have used artificial neural network to predict the quality of multidimensional schemas. Here it is shown that ANN is able to successfully model the complex and non linear relationships between the quality metrics and understandability which is one of the quality factors of maintainability of data warehouse multidimensional model. I. II. RELATED WORK INTRODUCTION Very little work is done to assess the quality of multidimensional model objectively using metrics. Si-Said and Prat[10] have proposed few metrics for multidimensional schema’s analyzability and simplicity. Nevertheless, none of these metrics proposed have been empirically validated and therefore, have not proven their practical utility [11]. Calero et. al. proposed few metrics to assess the quality of data warehouse model objectively [2]. These metrics are theoretically validated using Zuse framework for formal validation. Subsequently, the authors provided various empirical validations for some of these metrics [3][5][12][15][16]. Various techniques like correlation analysis, regression, case base reasoning, Formal concept analysis etc. are used and indicated that the metrics proposed are significantly related to the quality of multidimensional models. Artificial neural networks have played significant role in predicting software quality accurately. Various researchers in the field of software quality have used neural networks to predict quality of software [18]. Mei et. al. used Ward neural network and General Regression neural network (GRNN) to predict the number of defects in a class using Object-oriented design metrics[13]. Work is also done to predict the fault proneness of classes on the basis of object oriented metrics using neural network [19].The Data warehouse is a subject-oriented, integrated, time variant and non-volatile collection of data in support of management’s decision making processes [7]. Due to strategic importance of data warehouse, it is absolutely crucial for organization to guarantee the quality of data warehouse so that the decision makers can make better decisions. Quality of data warehouse multidimensional model has a great influence on the overall data warehouse quality and hence, in turn on information quality [2]. In the past, researchers have suggested interesting recommendations to achieve a good dimensional data model [7][8]. However quality criteria are not enough on their own to ensure quality in practice. Metrics have been used to build prediction systems for database projects, to understand and improve software development and maintenance projects, to maintain the quality of systems, highlighting problematic areas. However, in context of data warehouse, few authors have proposed metrics for data warehouse multidimensional model [2][3]. Few of these metrics have been theoretically and empirically validated using various techniques. Most of the empirical validation is done using statistical techniques like correlation analysis, univariate and multivariate regression analysis etc. which are unable to model the non linear relationship between the metrics and data warehouse multidimensional model ©2010 ACEEE DOI: 02 ACS.2010.01.223 241 Proc. of Int. Conf. on Advances in Computer Science 2010 authors have empirically validated using a data set collected from the software modules. The results are compared with two statistical models using five quality attributes and found that neural networks do better. Artificial neural network have also been used to estimate the maintainability of software and maintenance effort for any software. III. The metrics for star schema given in fig. 1 are calculated in table 1. TABLE I: CALCULATED METRICS FOR STAR SCHEMA IN FIGURE1 Metrics OBJECTIVE AND MOTIVATION NFT 1 NDT 4 NFK 4 NMFT 2 V. PROPOSED NEURAL NETWORK MODEL Various mathematical and machine learning or artificial intelligence (AI) based techniques like regression analysis, artificial neural networks (ANN), genetic algorithms (GA), fuzzy logic (FL), case based reasoning etc. are being used for accurate prediction and estimation of quality., adaptation and evolution e.g. NN are able Artificial intelligence combines the elements of learning to learn from experimental data, represent highly non-linear and multivariate relationships, and are expertise or rule based. These have been successfully applied to predict quality of software and it is observed that neural based systems are able to approximate the non linear functions with more precision. But till now there is no work related to application of neural network to predict quality of multidimensional models. So in this proposed work, it is tried to use neural network approach to build model to predict the quality of multidimensional model for data warehouse. This section briefly describes the architecture of the proposed neural network and the criteria on which the performance of the proposed neural network is evaluated. A. Architecture/Learning algorithm The general architecture of the present NN model shown in figure 2 is described in this section. The model can be viewed as a directed graph composed of nodes and connections (weights W11, W12 … and B1, B2 …) between nodes. A set of training vectors is presented to the neural network one at a time. The input node(s) are connected to every node of the hidden layer but are not (directly) connected to the output node. The neural network considered in this work is multi layer feed forward network with four inputs, 15 hidden neurons and one output neuron. The maximum number of epoch is set as 1000. The transfer function used is tansig. The training algorithm used is Trainbr (Bayesian regularization). Trainbr is a network training function that updates the weight and bias values according to LevenbergMarquardt optimization. It minimizes a combination of squared errors and weights, and then determines the correct combination so as to produce a network that generalizes well. The network is trained using 60% of input data set and rest were used for testing purpose. IV. METRICS FOR MULTIDIMENSIONAL MODEL UNDERSTANDABILITY This section briefly discusses the metrics (proposed by Serrano et.al [3]) for data warehouse multidimensional models. The proposed metrics are as follows: NFT (Sc): Number of fact tables of the schema. NDT (Sc): Number of dimension tables of the schema. NFK (Sc). Number of foreign keys in all the fact tables of the schema NMFT(Sc): Number of measurable facts in the schema To understand these metrics, let us consider a star schema as given in figure 1. Input Layer Hidden Layer Output Layer 1 1 2 Company PRODUCT 1 2 Company_id Comapany Name Region Name Branch Prod_id Prod_cat Prod_ name Prod_feature Unit price 3 Sales Fact 4 Prod_id(FK) Company_id(FK) Location_id(FK) Time_id(FK) Salesin dollars Quantity Location Time Location_id Country name State name City name Time_id Week Month year Figure1: ©2010 ACEEE DOI: 02 ACS.2010.01.223 15 Fig.2 Neural Network architecture (4-15-1) B. Performance criteria Performance of a trained neural network can be measured on the basis of various parameters e.g. sum of squared errors(SSE), mean square error (MSE), mean absolute relative error (MARE), Mean absolute percentage Star Schema 242 Proc. of Int. Conf. on Advances in Computer Science 2010 error (MAPE), root mean square error (RMSE), etc. But in this paper MARE is used to analyse the performance of network. The table 3 shows the experimental results in terms of MARE (training and testing), and R i.e. correlation coefficient, as given below. n MARE = (Σ abs ((yi - ŷi ) / yi))/N TABLE II: i=1 EXPERIMENTAL RESULTS (PERFORMANCE OF NEURAL NETWORK IN TERMS OF MARE AND R) Where yi represents the ith value of the actual output and ŷi is the estimated output. In this work, we have used the Matlab NN toolbox functions. For Training data For testing data VI. EXPERIMENT MARE 0.0726 R .975 0.0813 .94 The regression plot showing the target and actual values as predicted by NN is shown in Fig 4. In this work, the effect of proposed metrics on understandability of data warehouse schemas is demonstrated. Hence, the output variable is understandability. Understandability is also sub- characteristic of maintainability (ISO2001). Boehm defined understandability as a characteristic of software quality which means ease of understanding software systems [17]. A. Data collection We have applied the ANN technique to the data set available in Serrano et. al [3], in which the authors have collected data for the metrics and conducted experiment to collect the time required to understand multidimensional schemas . The schemas, for which average understanding time is more, will be difficult to understand (less understandable). Fig.4. Regression plot showing the target and actual values as predicted by NN Results of the network architecture (4-15-1) indicate that neural network can predict the quality of data warehouse multidimensional model with acceptable accuracy as is evident from the high value of correlation coefficient ‘R’ (more than 0.9) and low value of MARE for training as well as testing data. B. Analysis of results The results of training of neural network using Trainbr algorithm is shown in fig. 3 below, which shows that the sum of squared error for training is 0.1668 approximately. VII. CONCLUSION Empirical validations play important role to build cumulative knowledge to extract useful measurement conclusions to be applied in practice. In this work NN technique is used to estimate understandability of data warehouse multidimensional model. The NN proposed in this work is able to predict the target closely matching the measured experimental values more accurately. However, we feel more similar type of studies must be carried out with large data sets to generalize the results and for validity of results. Fig.3: Results of training function ©2010 ACEEE DOI: 02 ACS.2010.01.223 243 Proc. of Int. Conf. on Advances in Computer Science 2010 [12] REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] Jeusfeld M.A., Christoph Quix, jarke M.(1998), “Design and analysis of quality information for data warehouses”, Intl conference on Conceptual Modelling ER 1998 Calero C., Piattini M., Carolina Pascual, Serrano, M. A. “Towards Data warehouse quality metrics”, 3rd International workshop on design and Management of Data warehouses (DMDW 2001), Interlaken, Switzerland. Manuel A. Serrano, Coral Calero, Houari A. Sahraoui, “Empirical studies to assess the understandability of data warehouse schemas structural metrics”, Software Quality Journal 16(1): 79-106.,2008 Bouzeghoub M., Fabret F., Galhardas H,”Data warehouse refreshment, Chapter 4”, Fundamentals of Data warehouses Serrano M., Trujillo, J. ,Calero C., Piattini, M. Serrano, M. A., “Metrics for data warehouse conceptual models understandability” Information and Software Technology (INFSOF). Elsevier. Vol.49(8), pp:851-870, 2007. “Software product evaluation-quality characteristics and guidelines for their use”, ISO/IEC standard 9126. Inmon,, W. H(1997)., “Building Data warehouse”, John Wiley & sons. Kimball R., Reeves L, Ross M., Thornthwaite, W. (1998), “The data warehouse lifecycle toolkit”, John Wiley & sons. Moody Daniel L(2005),”Theoretical and practical issues in evaluating the quality of conceptual models: current state and future directions”, Data and Knowledge Engineering, Volume 55 , Issue 3, pp 243-276. Said Si, Prat N. (2003), “Multidimensional Schemas Quality: Assessing and Balancing Analyzability and Simplicity”, ER Workshops 2003, Springer LNCS 2814, Pp 140-151. Fenton N., Pfleeger S(1996), “Software Metrics: A rigorous Approach”, Chapman & Hall, London 1996. ©2010 ACEEE DOI: 02 ACS.2010.01.223 [13] [14] [15] [16] [17] [18] [19] [20] 244 Serrano M. A., Coral Calero, Piattini M., “Empirical Validation of Multidimensional Data model metrics”, IEEE Hawaii International Conference on System Sciences, 2002. Mie Thet Thwin, Tong-seng Quah, “Application of neural networks for software quality prediction using object-oriented metrics”, Journal of Systems and Software Volume 76, Issue 2 (May 2005) Pages: 147 156 , 2005. Calero C., Piattini, M., Genero M, “Metrics for controlling database complexity : Chapter III in Developing quality complex database systems: Practices, Techniques and Technologies”, Becker (ed), Idea Group Publishing.2001 Serrano, M. A., Calero, C., Trujillo, J., Lujan, S., Piattini, M, “Empirical Validation of Metrics for Conceptual Models of Data warehouse”, Lecture Notes in Computer Science 3084 (CAiSE 2004) Pp. :506-520. Serrano, M.,Calero, C.,Piattini, M, “Validating metrics for data data warehouse, IEE Proceedings SOFTWARE in association with BCS.Vol.149, No. 5, Pp.:161166.,2002 Boehm B. W. (1978), “Characteristic of software quality”, North Holland. Khoshgaftaar, T., Allen, E.D., Hudepohl, J.P, Aud, S.J. Application of neural networks to software quality modeling of a very large telecommunications system. IEEE Transactions on Neural Networks, 1997, 8(4) 902-909. Kanmani S., Rhymend Uthariaraj, Sankaranarayana, Thambidurai P., “Object oriented software fault prediction” Information and Software Technology, Volume 49 (5) 2007, Pages 483-492 Jarke Matthias, YV, “Data Warehouse Quality: A review of the DWQ Project”, in Conference of Information Quality, Massachusetts Institute of Technology, Cambridge.