Download Neural Network Approach to Predict Quality of Data Warehouse

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Catastrophic interference wikipedia , lookup

Cross-validation (statistics) wikipedia , lookup

The Measure of a Man (Star Trek: The Next Generation) wikipedia , lookup

Mathematical model wikipedia , lookup

Pattern recognition wikipedia , lookup

Data (Star Trek) wikipedia , lookup

Convolutional neural network wikipedia , lookup

Time series wikipedia , lookup

Transcript
Proc. of Int. Conf. on Advances in Computer Science 2010
Neural Network Approach to Predict Quality of
Data Warehouse Multidimensional Model
Anjana Gosain1 , Sangeeta Sabharwal2,Sushama Nagpal3
1
School of Information Technology
GGSIPU, New Delhi-110006
2
Computer Engineering Division
Netaji Subhas Institute of Technology
Dwarka, New Delhi-110078
3
Computer Engineering Division
Netaji Subhas Institute of Technology
Dwarka, New Delhi-110078
quality. In this paper it is proposed to use neural network
approach to predict the quality of multidimensional model
for data warehouse.
It is observed that ANN is able to successfully model
the complex and non linear relationships between the
quality metrics proposed by [3] and understandability of
data warehouse multidimensional model, which is one of
the quality factors of maintainability.
The Organization of the paper is as follows: Section II
discusses the related work; Section III discusses objective
and motivation for doing this work; Section IV briefly
describes the proposed metrics for data warehouse schema.
In Section V, architecture of proposed neural network is
explained. Section VI discusses the experiment conducted
and analysis of the results. Section VII summarizes the
work and presents conclusion.
Abstract - Estimating the quality of data warehouse at an early
stage is an important task. The quality of data warehouse
depends on the data quality and data warehouse
multidimensional model. Very little work is done to assess the
quality of multidimensional model objectively using metrics.
Some statistical techniques like correlation analysis,
univariate and multivariate regression techniques, etc. have
been used which indicated that these metrics are significantly
related to the quality of multidimensional models for data
warehouse. However, in context of data warehouse, very little
work has been done to predict the multidimensional model
quality using machine learning techniques. In this paper, we
have used artificial neural network to predict the quality of
multidimensional schemas. Here it is shown that ANN is able
to successfully model the complex and non linear relationships
between the quality metrics and understandability which is
one of the quality factors of maintainability of data warehouse
multidimensional model.
I.
II. RELATED WORK
INTRODUCTION
Very little work is done to assess the quality of
multidimensional model objectively using metrics. Si-Said
and Prat[10] have proposed few metrics for
multidimensional schema’s analyzability and simplicity.
Nevertheless, none of these metrics proposed have been
empirically validated and therefore, have not proven their
practical utility [11]. Calero et. al. proposed few metrics to
assess the quality of data warehouse model objectively [2].
These metrics are theoretically validated using Zuse
framework for formal validation. Subsequently, the authors
provided various empirical validations for some of these
metrics [3][5][12][15][16]. Various techniques like
correlation analysis, regression, case base reasoning,
Formal concept analysis etc. are used and indicated that the
metrics proposed are significantly related to the quality of
multidimensional models.
Artificial neural networks have played significant role
in predicting software quality accurately. Various
researchers in the field of software quality have used neural
networks to predict quality of software [18]. Mei et. al.
used Ward neural network and General Regression neural
network (GRNN) to predict the number of defects in a class
using Object-oriented design metrics[13]. Work is also
done to predict the fault proneness of classes on the basis
of object oriented metrics using neural network [19].The
Data warehouse is a subject-oriented, integrated, time
variant and non-volatile collection of data in support of
management’s decision making processes [7]. Due to
strategic importance of data warehouse, it is absolutely
crucial for organization to guarantee the quality of data
warehouse so that the decision makers can make better
decisions. Quality of data warehouse multidimensional
model has a great influence on the overall data warehouse
quality and hence, in turn on information quality [2]. In the
past,
researchers
have
suggested
interesting
recommendations to achieve a good dimensional data
model [7][8]. However quality criteria are not enough on
their own to ensure quality in practice. Metrics have been
used to build prediction systems for database projects, to
understand and improve software development and
maintenance projects, to maintain the quality of systems,
highlighting problematic areas. However, in context of data
warehouse, few authors have proposed metrics for data
warehouse multidimensional model [2][3]. Few of these
metrics have been theoretically and empirically validated
using various techniques. Most of the empirical validation
is done using statistical techniques like correlation analysis,
univariate and multivariate regression analysis etc. which
are unable to model the non linear relationship between the
metrics and data warehouse multidimensional model
©2010 ACEEE
DOI: 02 ACS.2010.01.223
241
Proc. of Int. Conf. on Advances in Computer Science 2010
authors have empirically validated using a data set
collected from the software modules. The results are
compared with two statistical models using five quality
attributes and found that neural networks do better.
Artificial neural network have also been used to estimate
the maintainability of software and maintenance effort for
any software.
III.
The metrics for star schema given in fig. 1 are calculated in
table 1.
TABLE I:
CALCULATED METRICS FOR STAR SCHEMA IN FIGURE1
Metrics
OBJECTIVE AND MOTIVATION
NFT
1
NDT
4
NFK
4
NMFT
2
V. PROPOSED NEURAL NETWORK MODEL
Various mathematical and machine learning or artificial
intelligence (AI) based techniques like regression analysis,
artificial neural networks (ANN), genetic algorithms (GA),
fuzzy logic (FL), case based reasoning etc. are being used
for accurate prediction and estimation of quality.,
adaptation and evolution e.g. NN are able Artificial
intelligence combines the elements of learning to learn
from experimental data, represent highly non-linear and
multivariate relationships, and are expertise or rule based.
These have been successfully applied to predict quality of
software and it is observed that neural based systems are
able to approximate the non linear functions with more
precision. But till now there is no work related to
application of neural network to predict quality of
multidimensional models. So in this proposed work, it is
tried to use neural network approach to build model to
predict the quality of multidimensional model for data
warehouse.
This section briefly describes the architecture of the
proposed neural network and the criteria on which the
performance of the proposed neural network is evaluated.
A. Architecture/Learning algorithm
The general architecture of the present NN model
shown in figure 2 is described in this section. The model
can be viewed as a directed graph composed of nodes and
connections (weights W11, W12 … and B1, B2 …)
between nodes. A set of training vectors is presented to the
neural network one at a time. The input node(s) are
connected to every node of the hidden layer but are not
(directly) connected to the output node.
The neural network considered in this work is multi
layer feed forward network with four inputs, 15 hidden
neurons and one output neuron. The maximum number of
epoch is set as 1000. The transfer function used is tansig.
The training algorithm used is Trainbr (Bayesian
regularization). Trainbr is a network training function that
updates the weight and bias values according to LevenbergMarquardt optimization. It minimizes a combination of
squared errors and weights, and then determines the correct
combination so as to produce a network that generalizes
well.
The network is trained using 60% of input data set and
rest were used for testing purpose.
IV. METRICS FOR MULTIDIMENSIONAL MODEL
UNDERSTANDABILITY
This section briefly discusses the metrics (proposed by
Serrano et.al [3]) for data warehouse multidimensional
models. The proposed metrics are as follows:
NFT (Sc): Number of fact tables of the schema.
NDT (Sc): Number of dimension tables of the schema.
NFK (Sc). Number of foreign keys in all the fact tables of
the schema
NMFT(Sc): Number of measurable facts in the schema
To understand these metrics, let us consider a star
schema as given in figure 1.
Input
Layer
Hidden Layer
Output
Layer
1
1
2
Company
PRODUCT
1
2
Company_id
Comapany Name
Region Name
Branch
Prod_id
Prod_cat
Prod_ name
Prod_feature
Unit price
3
Sales Fact
4
Prod_id(FK)
Company_id(FK)
Location_id(FK)
Time_id(FK)
Salesin dollars
Quantity
Location
Time
Location_id
Country name
State name
City name
Time_id
Week
Month
year
Figure1:
©2010 ACEEE
DOI: 02 ACS.2010.01.223
15
Fig.2
Neural Network architecture
(4-15-1)
B. Performance criteria
Performance of a trained neural network can be
measured on the basis of various parameters e.g. sum of
squared errors(SSE), mean square error (MSE), mean
absolute relative error (MARE), Mean absolute percentage
Star Schema
242
Proc. of Int. Conf. on Advances in Computer Science 2010
error (MAPE), root mean square error (RMSE), etc. But in
this paper MARE is used to analyse the performance of
network.
The table 3 shows the experimental results in terms of
MARE (training and testing), and R i.e. correlation
coefficient, as given below.
n
MARE = (Σ abs ((yi - ŷi ) / yi))/N
TABLE II:
i=1
EXPERIMENTAL RESULTS (PERFORMANCE OF NEURAL NETWORK IN TERMS
OF MARE AND R)
Where yi represents the ith value of the actual output and ŷi
is the estimated output.
In this work, we have used the Matlab NN toolbox
functions.
For Training
data
For
testing
data
VI. EXPERIMENT
MARE
0.0726
R
.975
0.0813
.94
The regression plot showing the target and actual values
as predicted by NN is shown in Fig 4.
In this work, the effect of proposed metrics on
understandability of data warehouse schemas is
demonstrated.
Hence,
the
output
variable
is
understandability.
Understandability is also sub- characteristic of
maintainability
(ISO2001).
Boehm
defined
understandability as a characteristic of software quality
which means ease of understanding software systems [17].
A. Data collection
We have applied the ANN technique to the data set
available in Serrano et. al [3], in which the authors have
collected data for the metrics and conducted experiment to
collect the time required to understand multidimensional
schemas . The schemas, for which average understanding
time is more, will be difficult to understand (less
understandable).
Fig.4. Regression plot showing the target and actual values as predicted
by NN
Results of the network architecture (4-15-1) indicate
that neural network can predict the quality of data
warehouse multidimensional model with acceptable
accuracy as is evident from the high value of correlation
coefficient ‘R’ (more than 0.9) and low value of MARE for
training as well as testing data.
B. Analysis of results
The results of training of neural network using Trainbr
algorithm is shown in fig. 3 below, which shows that the
sum of squared error for training is 0.1668 approximately.
VII. CONCLUSION
Empirical validations play important role to build
cumulative knowledge to extract useful measurement
conclusions to be applied in practice. In this work NN
technique is used to estimate understandability of data
warehouse multidimensional model. The NN proposed in
this work is able to predict the target closely matching the
measured experimental values more accurately. However,
we feel more similar type of studies must be carried out
with large data sets to generalize the results and for validity
of results.
Fig.3: Results of training function
©2010 ACEEE
DOI: 02 ACS.2010.01.223
243
Proc. of Int. Conf. on Advances in Computer Science 2010
[12]
REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
Jeusfeld M.A., Christoph Quix, jarke M.(1998),
“Design and analysis of quality information for data
warehouses”, Intl conference on Conceptual Modelling
ER 1998
Calero C., Piattini M., Carolina Pascual, Serrano, M. A.
“Towards Data warehouse quality metrics”, 3rd
International workshop on design and Management of
Data warehouses (DMDW 2001), Interlaken,
Switzerland.
Manuel A. Serrano, Coral Calero, Houari A. Sahraoui,
“Empirical studies to assess the understandability of
data warehouse schemas structural metrics”, Software
Quality Journal 16(1): 79-106.,2008
Bouzeghoub M., Fabret F., Galhardas H,”Data
warehouse refreshment, Chapter 4”, Fundamentals of
Data warehouses
Serrano M., Trujillo, J. ,Calero C., Piattini, M. Serrano,
M. A., “Metrics for data warehouse conceptual models
understandability”
Information
and
Software
Technology (INFSOF). Elsevier. Vol.49(8), pp:851-870,
2007.
“Software product evaluation-quality characteristics and
guidelines for their use”, ISO/IEC standard 9126.
Inmon,, W. H(1997)., “Building Data warehouse”,
John Wiley & sons.
Kimball R., Reeves L, Ross M., Thornthwaite, W.
(1998), “The data warehouse lifecycle toolkit”, John
Wiley & sons.
Moody Daniel L(2005),”Theoretical and practical issues
in evaluating the quality of conceptual models: current
state and future directions”, Data and Knowledge
Engineering, Volume 55 , Issue 3, pp 243-276.
Said Si, Prat N. (2003), “Multidimensional Schemas
Quality: Assessing and Balancing Analyzability and
Simplicity”, ER Workshops 2003, Springer LNCS 2814,
Pp 140-151.
Fenton N., Pfleeger S(1996), “Software Metrics: A
rigorous Approach”, Chapman & Hall, London 1996.
©2010 ACEEE
DOI: 02 ACS.2010.01.223
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
244
Serrano M. A., Coral Calero, Piattini M., “Empirical
Validation of Multidimensional Data model metrics”,
IEEE Hawaii International Conference on System
Sciences, 2002.
Mie Thet Thwin, Tong-seng Quah, “Application of
neural networks for software quality prediction using
object-oriented metrics”, Journal of Systems and
Software Volume 76, Issue 2 (May 2005) Pages: 147 156 , 2005.
Calero C., Piattini, M., Genero M, “Metrics for
controlling database complexity : Chapter III in
Developing quality complex database systems:
Practices, Techniques and Technologies”, Becker (ed),
Idea Group Publishing.2001
Serrano, M. A., Calero, C., Trujillo, J., Lujan, S.,
Piattini, M, “Empirical Validation of Metrics for
Conceptual Models of Data warehouse”, Lecture Notes
in Computer Science 3084 (CAiSE 2004) Pp. :506-520.
Serrano, M.,Calero, C.,Piattini, M, “Validating metrics
for data data warehouse, IEE Proceedings SOFTWARE
in association with BCS.Vol.149, No. 5, Pp.:161166.,2002
Boehm B. W. (1978), “Characteristic of software
quality”, North Holland.
Khoshgaftaar, T., Allen, E.D., Hudepohl, J.P, Aud, S.J.
Application of neural networks to software quality
modeling of a very large telecommunications system.
IEEE Transactions on Neural Networks, 1997, 8(4)
902-909.
Kanmani S., Rhymend Uthariaraj, Sankaranarayana,
Thambidurai P., “Object oriented software fault
prediction” Information and Software Technology,
Volume 49 (5) 2007, Pages 483-492
Jarke Matthias, YV, “Data Warehouse Quality: A
review of the DWQ Project”, in Conference of
Information Quality, Massachusetts Institute of
Technology, Cambridge.