* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Version2 - School of Computer Science
Survey
Document related concepts
Transcript
Analyzing Empirical Data in Software Engineering Li Jiang Armin Eberlein Aneesh Krishna School of Computer Science The University of Adelaide, SA, 5000, Australia Computer Engineering Department American University of Sharjah, UAE Curtin University of Technology Perth, WA 6102, Australia Abstract: Getting meaningful information from empirical data is a challenging task in software engineering (SE) research. It requires an in-depth analysis of the problem data and structure to select the most suitable data analysis methods as well as an evaluation of the validity of the analysis result. This paper reports experiences with three data analysis methods that were used to analyze a set of empirical data. One of the major findings is that although each method has its own value, none of them is sufficient to address all challenges on its own. The research reveals that it is only possible to get meaningful analysis results if several data analysis methods are combined. Keywords: Requirements Engineering, Software Engineering, Requirements Engineering Techniques, Data Analysis Methods, Clustering. I. INTRODUCTION The development of large and medium-sized software systems usually involves complex processes that make use of several development techniques. Since the term “software engineering (SE)” was first coined in 1968 at the first SE conference, numerous SE techniques have been proposed. However, early experience has shown that there is no silver bullet to deal with software engineering problems [1]. Therefore, most software development processes employ a combination of techniques [2]. Furthermore, several researchers have emphasized that it is important to select and use suitable software engineering techniques to tackle problems in software development [3-8]. Nevertheless, it is not trivial to assess the suitability of an SE technique within the context of a software project as many techniques are available and numerous factors influence decision making. We therefore started a research project focusing on the analysis of the suitability of requirements engineering (RE) techniques for a software project based on its characteristics [9]. This resulted in methodologies 1 and a framework that can help select the most suitable RE techniques for a software project. The aim now is to broaden the framework to the entire software development process, i.e., to develop 1 We acknowledge the differences between the two terms “method” and “technique” as used in the SE research community and the disparities of the definitions given for these two terms in academia. The term “method” is deliberately used in this paper to refer to any one or more algorithms and/or methods created for data clustering and data analysis. The purpose of adopting this terminology (in this paper only) is to differentiate the two terms “method” and “technique” with the latter referring to SE techniques or methods. methods for the project-specific selection of SE techniques. This is possible as RE techniques are a subset of SE techniques, i.e., they possess similar knowledge elements and structure [9]. However, to extend our previous framework on the selection of RE techniques to the selection of the SE techniques is very challenging as the number of SE techniques is much larger than that of RE techniques. Thus finding good methods to help analyze SE techniques information will be the first problem we need to solve in order to extend the mythologies that we developed and used before to the selection of SE techniques. In our previous research, we have: analyzed 46 RE techniques in depth. The RE techniques are the most often used, well-documented and mature techniques [2]. They are listed in Appendix 1. developed a set of attributes that help characterize RE techniques [9]. The list of the techniques attributes is given in Appendix 2. The attributes are classified into two categories: Category 1 includes 13 essential attributes that are generally applicable to all RE techniques. The attributes in category 2 are supplementary to those in category 1 and provide additional information for the suitability of RE techniques. conducted a survey among 8 RE experts from both industry and academia to elicit a set of empirical data about the abilities of 46 RE techniques in dealing with practical problems [9]. The empirical data is sanitized and validated against the research results published by others. The obtained data set is shown in Appendixes 3A and 3B. As shown in the table, each technique contains 31 attributes (31 dimensional information), a multidimensional data point. analyzed RE techniques using the Fuzzy C-means (FCM) method [10-12]. The basic idea of the method is illustrated in Figure 1. Partial clustering results are given in Table 1 which includes the number of clusters and the values of the cost function of the algorithm obtained in the research (see [13] for more information about the research result). Moreover, several new concepts and relationships between the RE techniques have also been identified from the research, such as comparable techniques and complementary techniques [13]. Our initial data analysis of RE techniques using FCM provided information about the similarity and differences of Table 1 Clustering Result With The FCM Algorithm Setting Number of clusters=4 Performance using the setting Value of the cost function Number of clusters=5 Number of clusters=6 Number of clusters=7 Number of clusters=8 Number of clusters=9 Number of clusters=10 Number of clusters=11 Number of clusters=12 All weights are 1 (Wi=1) Various weights All weights are 1 (Wi=1) Various weights All weights are 1 (Wi=1) Various weights All weights are 1 (Wi=1) Various weights All weights are 1 (Wi=1) Various weights All weights are 1 (Wi=1) Various weights All weights are 1 (Wi=1) Various weights All weights are 1 (Wi=1) Various weights All weights are 1 (Wi=1) Various weights 27.12 12.96 25.43 10.65 21.78 6.18 18.81 5.11 17.55 3.60 17.92 3.64 18.12 5.50 19.55 5.87 19.38 6.81 Notes: The “weight” refers to the weight of each attribute of the techniques; “various weights” indicates that the attributes were assigned different weights based on the characteristics of the project. Initialization: stopping value , and fuzzification coefficient a, (a=2 is used in this research) and the initial partition matrix which satisfy. N 0 uik N i=1,2, …c, k 1 and C u i 1 ik 1 K=1,2, …, N For n=2 to c Repeat Update n mi u j 1 n a ij u j 1 uij Xj , a ij 1 dij k 1 d kj p a Until max ik uik (iteration 1) uik (iteration) < p n Calculate Cost uij a d ij a i 1 j 1 Endfor Where j=1,…, n; n is the number of objects, i=1, …, p; p is the number of clusters. mi is a vector representing the centroid of cluster i, d i , j X j mi is the distance between each object X j and the cluster centroid mi , u ij is the degree of membership of object j in cluster i. Cost is the cost function calculated for each clustering trial Fig. 1 Modified FCM Algorithm RE techniques. However, the detailed analysis of the clustering result revealed that a number of issues still remained unanswered. For instance, we do not know how good the clustering result is after we used FCM, and the fitness of some techniques to a cluster is questionable based on in-depth analysis of the techniques within a cluster. Additionally, finding the right number of clusters is a tedious process, as the clustering process has to be repeated many times. Furthermore, the traditional FCM is a local search algorithm that looks for the local minimum values of membership with regard to a set of the selected centroids of all data elements. However, support for RE techniques selection requires dynamically generated information about the similarities between RE techniques. The difficulty to find the optimum number of clusters of techniques prevents us from providing dynamic support for techniques analysis and selection in the methodology that we developed for RE technique selection. This is because clustering analysis is one of the early essential steps for identification of the relationships between the RE techniques in the methodology. Thus, one of the challenging questions here in helping SE techniques selection is how to find effective methods to analyze the data of SE techniques to facilitate better understanding of the techniques and the relationships between the techniques, and using the best method to help cluster and analyze SE techniques. To tackle the problem, we have systematically investigated the existing research on data analysis and mining on software engineering data. Our results of the investigation have shown the existing research on clustering SE data does not provide comparative or empirical analysis information or heuristics on which clustering techniques can help generate the clusters numbers automatically or with limited intervention of human being. To deal with this problem, we have explored and used three methods to help cluster and analyze RE techniques in this research: clustering based on statistical tests [14], genetic algorithm [21], and dimension reduction [23](Principal Component Analysis) with combination of FCM method. The objectives of this research is to understand if the existing clustering methods or other data analysis method can be used together to offer meaningful help for data clustering and analysis. This paper reports the results and experiences obtained in this research. The rest of the paper is organized as follows: Section 2 discuses some related research; section 3 presents the clustering method that is based on statistical tests. Section 4 presents our experiences of using a genetic algorithm to cluster RE techniques. The clustering of RE techniques by using the combination of dimension reduction technique and FCM is presented in Section 5. Section 6 presents our conclusion and future research. II. RELATED RESEARCH The work related to analysing software engineering data can be traced back to 1950s when the information about the “lines of the codes” was analysed [29]. However, the latest research in formal classification and analysis of the software engineering data can be accredited to the seminal work done by Khoshgoftaar and Allen [30], and Mendonca and Sunderhaft [31] in 1999. Khoshgoftaar and Allen use classification and regression trees (CART) algorithm to help modeling various software quality attributes. Whilst Mendonca and Sunderhaft have done a survey to explore the existing approaches that can be used in mining software engineering data. Since then, a lot of research has been done on mining software engineering data and using various data mining techniques and data analysis techniques to analyse software engineering data. For example, Zhong et. al. investigated two algorithm k-means and Neural-Gas clustering algorithms to conduct clustering-based analysis for software quality-estimation problems [32]. Jiang et. al investigated multiple centroid-based unsupervised clustering algorithms for network intrusion detection [33]. Dickinson et. al used clustering approach to analyze the execution profile to help find the failure data [34]. The most related research of the clustering algorithm analysis is the research done by Baraldi and Blonda in [35] where five fuzzy clustering algorithms for pattern reorganization are discussed and compared. However, the comparison and analysis are fundamentally based on theoretical models and data models are largely limited to the homogeneous dataset. Even though many Clustering methods have been used in analysing software engineering data during the last ten years; most of the existing research focuses on presenting the clustering results by designing or using a set of clustering algorithms. The advantage and disadvantage of these algorithms, and how to use the clustering algorithms to analyze heterogeneous dataset have not been explicitly researched or discussed. It appears that limited research has been found to systematically examine the merit of a clustering technique utilised in a specific application where the number of attributes is large and the data is fuzzy in nature. This research tried to explore merits and issues of the clustering techniques applied in a specific application. III. CLUSTERING BASED ON STATISTICAL TEST As has been discussed in Section 1, one of the major problems with FCM is that the optimum number of clusters cannot be determined before the actual clustering begins. It has to be determined by trial and error using repeated clustering. To solve this problem, a Statistical Test-based Clustering (STC) method was proposed by Gao et. al. [14]. The idea of the approach is to conduct statistical tests in each cluster formed in the trial so that a unimodal distribution is achieved in all the trial clusters. According to the experiment results of applying algorithm done by Gao et. al. [14], M, the number of origins randomly selected, needs to exceed 10 to ensure the fast convergence and generality of the algorithm. However, there are only 46 techniques (data points) in the current dataset, which means that we get less than 5 (46/10=4.6) data points in each cluster. Apparently 5 data points in each cluster are not enough to generate a normal distribution which is a fundamental requirement of application of the algorithm. To apply the algorithm, we need to extend our original data set. According to Lehmann and Casella [15], it is possible to construct such an extended data set X which is completely equivalent to the original X statistically. Thus, we have tried to inject 184 new data points (4 times the original data points) into the sampled space which increases the total data points to 230. These new data points are called “unlabelled data points” while the original data points are called “labelled data points” which allows us to differentiate them from the inserted ones. The unlabelled data points are inserted in a semi-random manner: the data are generated randomly with the same type and within the same range of original data (as shown in Appendix 3A and 3B). The inserted data can be considered as a kind of perturbed data. Unlike the initial objective of using the perturbed data to protect confidentiality of data [16], the objective of the inserting data in this research is to find data patterns or structures within the original data set. According to Burridge [17], the property of sufficiency of the perturbed data set with respect to the given statistical model must be the same as that of the original data set. Within the given data and the generated perturbed data in the research, it is possible to infer that the extended data set X and the original data set X have the same sufficient statistic 2 [36]. This is because the inserted data have the same types, range and distribution with the original ones. Thus, we can conclude that the result of the statistic analysis of the original data set will be the same as the statistic analysis of the extended data set obtained as described above. By using addition of perturbed data method, we are able to use the modified Statistical Test-based Clustering method and obtain the proper clustering numbers of the original dataset of RE techniques. The modified algorithm is shown in Figure 2. The algorithm was implemented by using C++ and the Matlab™ software package. By using the algorithm and extended data set approach as discussed above, we found that each cluster reached its single peak distribution when NL (number of clusters) =8. The obtained result NL is similar with the values we got in [13]. Our initial analysis with the STC algorithm is promising as the algorithm has a major advantage over FCM in that it provides us with the best way to find the exact number of clusters that we look for analyzing RE techniques. However, the current result is still subject to further experiments where the larger data set of SE techniques and heterogamous dataset will be used. As our knowledge of SE techniques increases, the number of available data sets of SE techniques will likely increase, consequently, application of Statistical Test-based Clustering method or further improvement of the algorithm is likely which can help to produce better analysis result. IV. 2 USING GENETIC ALGORITHM IN FUZZY CLUSTERING A sufficient statistic refers to a statistic that has the property of sufficiency with respect to a statistical model and its associated unknown parameter θ that are used in statistical calculation and reasoning [36], i.e. no other statistic that can be calculated from the same dataset provides any additional information as to the value of the parameter θ Set initial values used in the algorithm, including: (1) Setting M sampling origins represented with vector C randomly in the data space X={xi, .., xn}. n xi where C = i 1 , n is the number of elements in X. (2) Setting cn=2 where cn is the initial number of clusters. (3) Calculating the distance Di between each sample xi and vector C, and the distance Ui between xi and xi+1 where xi and xi+1 are perpendicular. (4) k is the number of neighbours of xi that are going to search; while p represent the number of clusters (5) Selecting 0.05 and it is easy to find T ( ) =1.64485 from the statistic table. (6) Setting s=0; 1. For each cluster, compute the following normalized T statistic K where K is the number of statistical test, and p is the dimension of the data set xi: TK ( 1 M M i 1 Dip (k ) 1 ) 12M 1 p p Di (k ) U i (k ) 2 2 T T ( ) If K , then s=s+1 Repeat 2 and 3 N times ( N>=100), then calculate the size of test s s/ N if s , then data set X is declared as multimodal distribution and is separable. else X is declared as unimodal distribution. 5 For all i, If data sets Xi X is unimodal, Then stop, and C is the number of clusters that is looking for Else cn=cn+1, and go to 2 Fig. 2 Algorithm for Computing the Number of Clusters Based on the mechanism of natural selection and genetics, genetic algorithms (GAs) have been designed and widely used in many optimization problems [18, 19]. GAs have also been used in clustering algorithms [20]. In this research, we want to explore if GAs can be used to help generate better clustering results or provide more information about the SE techniques data. After comparing different GAs used in clustering methods, the GA for the Fuzzy Clustering (GAFC) algorithm proposed by Zhao in [21] was used to help cluster the RE techniques in this research as the computation complexity of this algorithm is not very high and it can reach better convergence than other existing similar algorithms. According to Lee and Takagi, the number of generations and mutation probability can be set within the range of 10 to 160, and 0.0001 to 1.0 respectively [27]. In our experiments, we had found that if the number of generations set to above 15 and the mutation probability above 0.1, the algorithm can achieve a bit faster convergence that we expected in computation. Finally, we set the number of generations to 16, and the mutation probability to 0.10. C++ and several Matlab™ packages were used during the implementation of the GAFC algorithm. This algorithm was proved very expensive when we tried to use all 31 attributes (dimensions) to conduct clustering. After running the calculation for 2 weeks on a PC with a 1.8GHZ CPU, we were only able to get clustering results when the number of clusters is set to 3 or 4. To improve the efficiency of the algorithm, we decided to reduce the number of attributes used in the clustering. Instead of using all 31 attributes, 13 attributes in Category 1 (see Category 1 attributes in Appendix 2) were selected in clustering. These attributes were rated as highly important RE technique attributes by RE experts in our previous research. With the reduced number of attributes, the number of clusters and the values of the cost function calculated using GAFC algorithm are shown in Table 2. The values of the cost function calculated with the FCM algorithm are also presented in Table 2. As can be seen from the table, GAFC algorithm converged quickly in the calculation of the cost function before the number of cluster setting exceeds 6 with comparison to FCM However, it turns out that the performance of GAFC algorithm is worse than FCM when the number of clusters is greater than 6. Moreover, it can hardly reach the reasonable decision with GAFC about the exactly number of clusters that should be used in analysing RE techniques as the values of cost function of the algorithm decrease continuously even after the number of cluster passes 9. Our initial investigation has shown that the likely two major reasons for this phenomenon are: the limitation of the algorithm itself, and the characteristics of the date points that include 13 attributes i.e. the algorithm might not be suitable to higher dimension (more than 13 dimensional) data. Our further investigation on 6 attributes, randomly selected from Category 1 data set, shows that the value of the cost function with GAFC algorithm also converges fast and it reaches the minimum values when the number of clusters reaches 12. The generated number of clusters reaches 12 which is different from the number of clustering (number 8 or 9) generated with FCM algorithm in our early research. The major reason of the difference is that the essential information of the RE techniques data is lost since the number of attributes (each attribute can be considered as one dimension of a technique) used in the cluster is randomly removed. Thus, we conclude that: the performance of GAs is best when it applies to the data with less number of dimension (low-dimension), i.e. the dimension of data is less than 6. Table 2. GA Clustering Result Compared with FCM Clustering Results Number of Clusters Cost Value by Using GAFuzzy Clustering Algorithm Cost Value by Using Fuzzy C-Means Clustering Algorithm 2 3 4 5 6 7 8 9 12.76 9.32 7.86 6.72 5.91 5.01 4.05 3.29 18.89 16.23 12.96 10.65 6.18 5.11 3.60 3.64 it is not appropriate to use GAFC algorithm by randomly removing a number of dimensions of the given dataset. As the result of this observation, the immediate question is how to reduce the dimensions of the data points whist keeping the essential information of the data points so that the full potential of GAs can be achieved. To tackle this issue, we utilize dimension reduction methods which will be discussed in the next section. The issues related to the improvement of the GAFC algorithm itself is part of our future research. V. CLUSTERING BASED ON DIMENSION REDUCTION As has been discussed above, one of the major challenges of conducting effective data analysis is that the data points contain too many attributes (dimensions). Many data analysis and clustering algorithms cannot deal with multi-dimensional data effectively. Thus, one of the solutions to tackle this problem is to reduce the dimensions of the data points while keep the essential information of the original data. Dimension reduction methods have been used widely in computer vision and pattern recognition research and have proved effective in analysing the data that contains many dimensions [22-25]. The major objective of dimension reduction is to search for a propertypreserved low-dimensional representation of the higher dimensional data, i.e. to map the high dimensional space to lower dimensional space in such a way that the required properties are preserved. For example, we can map a data set { D1 (a1 a 2 …a i a i+1 …a n),… Dn (a1 a 2 …a i a i+1 …a n) } that contains n attributes to a data set { D′1 (a1 a 2 …a k)… D′n (a1 a 2 …a k) } by using certain dimension reduction algorithm; whilst D′1 (a1 a 2 …a k) contains the essential information of D1 (a1 a 2 …a i a i+1 …a n). Most often, the dimensional reduction problem is formulated as an optimization problem and the required properties are quantified by an objective function. Application of dimensional reduction techniques to software engineering data makes perfect sense as software engineering data usually have high dimensionality [28] There are many dimension reduction methods available, such as Principal Component Analysis [23], Projection Pursuit [24], and Principal Curves [25]. In our research, we use the Principal Component Analysis (PCA) method as it is the most widely used method in practice and suitable to the size and type of data that we want to process. Moreover, PCA has been implemented in one of the Matlab™ software packages (princomp) which can be used directly. The fundamental idea of PCA is to project the data with high dimensions along the dimensions with maximal variances so that the reconstruction error of lowdimension data points can be minimized and properties of the data points can be maximally preserved. In this research, we also used a modified algorithm of PCA in which the weight Wj [0, 1], j=1, m, is applied to each attribute based on the importance of each attributes with respect to RE technique judged by requirements engineers. The algorithm was also implemented with C++. The modified algorithm is illustrated in Figure 3 below. We have used both the princomp in Matlab™ and our implementation of the modified algorithm in our experiment in this research. By using the algorithms, we have successfully reduced the 31 dimensional data points (see Table 3) to 6 dimensional data points. Some examples of the 6 dimensional data points with princomp are presented in Table 3. The entire list of the generated data with the reduced dimensions is given in Appendix 4. The data generated in the dimension reduction operation (see Appendix 4) was clustered using the FCM algorithm. The results of the clustering are shown in Table 4. In the table, a is fuzzification coefficient, normally a is set to 2. We set a to 1.5 and 2 respectively in this research to compare the convergence effect of the FCM algorithm before and after using dimension reduction techniques. As can be seen, there are two choices for selecting the number of clusters: 8 and 9. The obtained number of clusters is essentially the same with the number of the clusterings that we used in our previous research where only FCM was used. The major gain of using PCA is that it helps to reduce the complexity of using FCM in the later stage of clustering. The major problem with the PCA is that it is still hard to tell the exact number of clusters that shall be chosen, i.e., this number has to be determined by humans as can be seen from Table 4, both NL=8 or NL=9 are valid option. Improving the algorithm further or combining PCA with GAFC is subject of future research. 1. Prepare the initial data set: (1) Initializing the data set X i , j , Wj, i=1, n; j=1, m, and and construct the original matrix Ui , j (2) For j=1 to m, calculating X i, j 1 n X Wj , n i1 i , j (3) Generating U inew , the new adjusted matrix, ,j U inew , j U i, j {X i, j } 2. Calculate (1) Calculating the covariance matrix of U inew ,j (2) Calculating the eigenvectors and eigenvalues of the covariance matrix in 2(1) (eig1, eig2,… eigm), (3) Choosing the components and forming a feature vector FeatureVector=(eig1, eig2,… eigk ), k (k<=m) is the number of the components selected for the projection of the original data in Ui , ,j (4) Deriving the new data set new FinalDataSet = FeatureVector × U i , j Note: Wj [0, 1], j=1, m, is the weight given to each attribute Fig. 3 A Modified PCA Algorithm Table 3. An Example of Generated Dataset after Dimension Reduction Techniques T1 T2 T3 T4 T5 T6 T7 T8 D1 -1.892500 -1.630000 -1.168700 -1.914800 -1.672900 -1.915200 -1.658000 -0.779230 D2 0.079340 0.103430 0.379490 -0.322170 -0.824850 -0.281250 -0.170510 -0.704200 Notes: (1) Di represents dimension i . D3 -0.206770 -0.414710 -0.075140 -0.503340 -0.532970 -0.304210 -0.246680 0.290550 Value of Cost Function (a=1.5) Value of Cost Function (a=2) NL=2 0.049800 0.074970 NL=3 -0.004067 0.023239 NL=4 -0.011247 0.000535 NL=5 -0.009762 0.002730 NL=6 -0.001504 0.030802 NL=7 -0.000941 0.020111 NL=8 -0.013643 -0.010694 NL=9 -0.013643 -0.012545 -0.009591 NL=10 Note: a is fuzzification coefficient. VI. D5 -0.419870 -0.663370 -0.558350 -0.617720 0.177100 -0.334170 -0.345710 -0.197310 D6 -0.069110 -0.106870 0.117600 0.004040 -0.117170 0.214430 0.303790 0.635260 (2) See Appendix 1 for the name of Ti Table 4. Clustering Result Number of Clusters D4 0.030650 0.304470 -0.112600 0.045320 0.094020 0.018030 0.195570 -0.309960 will lead to better results for clustering analysis which is the question that is subject to our future research. To design and develop a tool which can facilitate using different data analysis methods to analyse SE data is another topic for our future research. REFERENCES [1] [2] [3] 0.002380 CONCLUSION AND FUTURE WORK Analysis of empirical data is a difficult yet important task in SE [26]. This paper reported our experiments with three data analysis methods and algorithms: clustering based on statistical tests method, genetic algorithm method and the clustering by using dimension reduction method on the empirical data obtained in our previous research. We presented the issues and demonstrated possible ways to help deal with the determination of the cluster number and reduction of dimensions for effective clustering. The research has shown that the best solution to analyse empirical data is to combine different data analysis methods. This combined approach might be a time-consuming and daunting process; however, it is the only way to help discover meaningful information and underlying structure of the data. Moreover, with further research and combination of more clustering methods, it is possible to reduce the effort used in analysing SE data. At this stage, it is safe to say that the STC algorithm is promising as the algorithm can provides a way to find the exact number of clusters that we look for even though the validity of the conclusion is still subject to further experiments where the larger data set of SE techniques and heterogamous dataset are used. Additionally, a combination of PCA and FCM can provide better data analysis results based on our experience obtained in this research. Finally, it is still difficult to say whether the combination of PCA, GA and FCM [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] Brooks F.: No Silver Bullet-Essence and Accident in Software Engineering, IEEE Computer, 20(4), 10-19 (1987), Jiang L., Eberlein A., Far B.H. and Mousavi M.: A Methodology for the Selection of Requirements Engineering Techniques, Journal of Software and Systems Modeling, 7 (3), 303-328 (2008) Glass R.L.: Matching Methodology to Problem Domain, Comm. Of The ACM, 47 (5), 19–21 (2004) Basili V.R.: The Role of Experimentation in Software Engineering: Past, Current, and Future, Proc. 18th Int. Conference on Software Engineering, Berlin, Germany, pp. 442–449 (1996) Emam K.E., Birk A.: Validating the ISO/IEC 15504 Measure of Software Requirements Analysis Process Capability, IEEE Trans. on Software Engineering, 26 (6), 119-149 (2000) Zowghi D., Damian D., Offen R.: Field Studies of Requirements Engineering in a Multi-Site Software Development Organization: Proc. Australian Workshop on Requirements Engineering, Univ. of New South Wales, (2001) Neill C.J., Laplante P.A.: Requirements Engineering: the State of the Practice, IEEE Software, 20 (6), 40–45 (2003) Antón A.I.: Successful Software Projects Need Requirements Planning, IEEE Software, 20 (3), pp 44–46 (2003) Jiang L.: A Framework For Requirements Engineering Process Development, PhD Thesis, University of Calgary, Canada, Sep. (2005) Dunn J. A Fuzzy Relative of the ISODATA Process and its use in Detecting Compact, Well Separated Cluster. Journal of Cybernetics 3(3),: 32-57 (1974) Cluster Validity With Fuzzy Sets, Journal of Cybernetics 3(3), 5871: (1974) Bezdek, J.C.. Pattern Recognition with Fuzzy Objective Function Algorithms, Plenum Press (1981) Jiang L, Eberlein A.: Clustering Requirements Engineering Techniques, The 10th IASTED International Conference on Software Engineering and Applications, Nov. 13-15, Dallas, Texas, USA. (2006) Gao X.B, Ji HB, Li J.: An Advanced Cluster Analysis Method Based on Statistical Test. IEEE ICSP, pp. 1100-1103, (2002) Lehmann, E.L., Casella, G. Theory of point estimation. New York: Springer-Verlag. (1998). [27] Lee M. A., Takagi H., “Dynamic control of genetic algorithms using fuzzy logic techniques,” in Proc. Int. Conf. Genetic Algorithms, Urbana-Champaign, IL, July 1993, pp. 76–83. [28] Goel A. L. & Shin M. Software engineering data analysis techniques (tutorial), Proceedings of the 19th international conference on Software engineering, Boston, Massachusetts, United States pp: 667 - 668, 1997 [29] Jones, C. Applied Software Measurement: Global Analysis of Productivity and Quality, Third Edition, McGraw-Hill, 2008 [30] Khoshgoftaar T. M. and Allen E. B. Modeling Software Quality with Classification Trees. In Recent Advances in Reliability and Quality Engineering, Hoang Pham Editor. World Scientific, Singapore, 1999. [31] Mendonca M. and Sunderhaft N. L.. Mining software engineering, data: A survey. A DACS state-of-the-art report, Data & Analysis Center for Software, Rome, NY, 1999. [32] Zhong S, Khoshgoftaar T.M., Seliya N,. Analyzing software measurement data with clustering techniques. IEEE Intelligent Systems, 19(2):20, 27, March/April 2004. [33] Jiang SY, Song XY, Wang H, et al. A clustering-based method for unsupervised intrusion detections, PATTERN RECOGNITION LETTERS Vol.: 27 Issue: 7, Pages: 802-810, MAY 2006 [34] Dickinson W., Leon D., and Podgurski A.. Finding failures by cluster analysis of execution profiles. In Proc. of the Int. Conf. on Software Engineering, 2001. [35] Baraldi A, Blonda P. A Survey of Fuzzy Clustering Algorithms for Pattern Recognition-Part I, IEEE Transactions on Systems. Man Cybern Part B: Cybern 1999;29(6):778–85. [36] Hogg, R. V., Craig, A.T. Introduction to Mathematical Statistics., Macmillan Publishing Co. Inc. 1978 [16] Liu K.; Kargupta, H.; Ryan, J.; Random projection-based multiplicative data perturbation for privacy preserving distributed data mining, IEEE Transactions on Knowledge and Data Engineering, 18(1), Jan. 2006, 92 - 106 [17] Burridge, J. Information preserving statistical obfuscation. Statistics and Computing, 13(4), 321–327.. (2003). [18] Gen, M., Cheng, R.: Genetic Algorithms and Engineering Design. New York: Wiley. (1997). [19] Chambers, L.D. The Practical Handbook of Genetic Algorithms Applications. Chapman & Hall, CRC, (2001). [20] Cordon, O.: Ten Years of Genetic Fuzzy Systems: Current Framework and New Trends, in Proceedings Joint 9th IFSA World Congress and 20th NAFIPS International Conference (Cat. No. 01TH8569)(0-7803-7078-3, 978-0-7803-7078-4), 1241, (2001) [21] Zhao L., Tsujimura Y., Gen M.: Genetic Algorithm for Fuzzy Clustering, In Proceedings of IEEE International Conference on Evolutionary Computation (0-7803-2902-3, 978-0-7803-2902-7), 716. (1996) [22] Carreira-Perpinan M.A.: A Review of Dimension Reduction Techniques. Technical Report CS-96-09, Department of Computer Science, University of Sheffield, (1997). [23] Jolliffe I.T.: Principal Component Analysis, Springer Series in Statistics, Springer-Verlag, Berlin, (1986). [24] Jones M.C.: The Projection Pursuit Algorithm for Exploratory Data Analysis, PhD thesis, University of Bath, (1983). [25] Hastie T.J., Stuetzle W.: Principal curves, J. of Ame. Stat. Assoc., 84, 502-516 (1989) [26] Shin M., Goel A.L.: Empirical Data Modeling in Software Engineering Using Radial Basis Functions. IEEE Tran. on Software Engineering (0098-5589), 26 (6), 567 (2000). Appendix 1 The List of RE Techniques and Notations Names of The Techniques Brain Storming and Idea Reduction Notation Used for Representation of Each Technique T1 Names of The Techniques State Charts Notation Used for Representation of Each Technique T24 Designer As Apprentice Document Mining T2 T3 Petri-Nets Structured Analysis (SA) T25 T26 Ethnography T4 Real-Time Structured Analysis T27 Focus Group Interview T5 T6 Object-Oriented Analysis Problem Frame Oriented Analysis T28 T29 Contextual Inquiry Laddering T7 T8 Goal-Oriented Verification and Validation Entity Relationship Diagrams T30 T31 Viewpoints-Oriented Elicitation Exploratory Prototype T9 T10 AHP Card Sorting T32 T33 Evolutionary Prototypes T11 Software QFD T34 Viewpoints-Oriented Analysis Repertory Grids T12 T13 Fault Tree Analysis Structured Natural Language Specification T35 T36 Scenario Approach JAD T14 T15 Viewpoints-Oriented Verification and Validation Unified Modeling Language (UML) T37 T38 The Soft Systems Methodology (SSM) T16 Z T39 Goal-Oriented Analysis Viewpoints-Based Definition T17 T18 LOTOS SDL T40 T41 Future Workshops Representation Modeling T19 T20 XP Formal Requirements Inspection T42 T43 Functional Decomposition Decision Tables T21 T22 Requirements Testing Requirements Checklists T44 T45 State Machine T23 Utility Test T46 Notation Used For Representation of Each Attribute Appendix 2. RE Techniques Attributes & Notations Notation Used For Representation of Each Attribute Techniques Attributes Category Techniques Attributes Notation Used For Representation of Each Attribute Category Ability to facilitate the communication A1 2 Capability for requirements verification A17 1 Ability to understand social issues A2 2 A18 2 Ability to get domain knowledge A3 1 Completeness of the semantics of the notation Ability to write unambiguous and precise A19 1 Ability to get implicit knowledge A4 1 requirements bycomplete using therequirements notation Ability to write A20 1 Ability to identify stakeholders A5 1 Capability for requirements management A21 1 Ability to identify non-functional requirements Ability to identify various viewpoints A6 1 Modularity A22 2 A7 2 Implementabillity (Executability) A23 2 Ability to model and understand requirements A8 1 A24 2 Understanding ability for the notations used in analysis Ability to analyze non-functional A9 2 Ability to identify the unambiguous requirements Ability to identify the interaction A25 1 A10 1 A26 2 requirements Ability to facilitate the negotiation with customer Ability to prioritize the requirements A11 2 (ambiguous, inconsistency, conflict) Ability to identify the incomplete requirements Ability to support COTS-based RE A27 2 A12 2 process of the supporting tool Maturity A28 1 Ability to identify the accessibility of the system to model interface requirements Ability A13 2 Learning curve (Introduction cost) A29 2 A14 1 Application cost A30 2 Ability to identify and support requirements reuse Ability to represent requirements A15 2 Complexity of the techniques A31 2 A16 2 (expressibility) Appendix 3A A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 T1 0.8 0.4 1 0.2 1 1 0.8 0 0 0.6 T2 0.8 1 1 1 0.2 1 0.2 0 0 0 T3 0 0.8 1 0.2 0.2 0.8 0.4 0 0 0 T4 0.6 0.8 1 1 0.6 0.4 0.4 0 0 0 T5 1 1 0.6 0.4 1 1 0.8 0 0 0 T6 0.8 0.8 0.6 0.2 1 1 0.8 0 0 0 T7 1 1 0.6 0.2 1 0.6 0.6 0 0 0 T8 0.6 0.6 0.6 0.2 1 0.6 0.6 0 0 0 T9 0.8 1 0.8 0.6 1 0.8 1 0.6 0 0 T10 0.8 0.2 0.4 0.2 0 0 0 0.8 0 0 T11 0.4 0 0 0 0 0 0 1 0.8 0 T12 0 0 0 0 0 0 0 0.8 0.8 0.6 T13 1 0.6 0.6 0.6 0.6 0.2 0.4 0 0 0 T14 0.8 0.6 0.4 0.2 0.4 0.2 0.8 1 1 0.2 T15 1 1 0.6 0.2 1 0.8 1 0 0 0 T16 1 1 0.6 0.2 1 0.4 0.6 0 0 0.6 T17 0 0 0 0 0 0 0 0.8 0.8 0.6 T18 0 0 0 0 0 0 0 0 0 0 T19 1 1 0.6 0.2 1 0.8 1 0.8 0 0.6 T20 0.8 0 0 0.2 0 0 0.2 1 1 0 T21 0 0 0 0 0 0 0 0.8 1 0.2 T22 0 0 0 0 0 0 0 1 1 0 T23 0 0 0 0 0 0 0 1 0.6 0 A11 0 0 0 0 0 0 0 0 0 0.8 0.2 0.8 0 0.4 0 0 0.8 0 0.4 0.4 0.4 0.4 0.4 RE Techniques Assessment (Empirical) Data (1) A12 0 0 0 0 0 0 0 0 0 0 0 0.6 0 0.4 0 0 0.4 0 0.6 0.2 0.2 0 0 A13 0 0 0 0 0 0 0 0 0 0.8 0.8 0.4 0 0.8 0 0 0.2 0 0.6 0.6 0.6 0.6 0.8 A14 0 0 0 0 0 0 0 0 0 1 1 0.4 0 0.6 0 0 0.4 0 0.4 1 0.2 0 0 A15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.2 0 0 0 0 0 0 A16 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0.8 0 1 0.8 1 1 A17 0 0 0 0 0 0 0 0 0 0 0.8 0 0 0.6 0 0 1 0.6 0.6 0.6 0.2 0.4 0.8 A18 0 0 0 0 0 0 0 0 0 0 0 0 0 0.6 0 0 0.8 1 0 0.6 0.6 0.2 0.6 A19 0 0 0 0 0 0 0 0 0 0 0 0 0 0.8 0 0 1 0.8 0 1 0.6 1 1 A20 0 0 0 0 0 0 0 0 0 0 0.4 0 0 0.6 0 0.4 0.8 0.8 0.6 0.6 0.6 0.6 0.6 A21 0 0 0 0 0 0 0 0 0 0 0 0 0 0.6 0 0 0.8 0.8 0 0.8 0.6 0.8 1 A22 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.8 0.2 0 0 0.6 0 0 A23 0 0 0 0 0 0 0 0 0 0 0 0 0 0.4 0 0 1 0.2 0 0 0 0 0 A24 0 0 0 0 0 0 0 0 0 0 0 0 0 0.4 0 0 1 0 0 0 0 0.2 0.6 A25 0 0 0 0 0 0 0 0 0 0 0.4 0 0 0.4 0 0 0.8 0 0 0.2 0 0.4 0.6 Legend: 1. See Appendix 1 for the technique name that each T j represents in the table. 2. See Appendix 2 for the attribute name that each Ai represents in the table. 3. The number in each cell represents the degree of how each technique satisfies each attribute A26 0 0 0 0 0 0 0 0 0.8 0 0.2 0 0 0.2 0 0.2 0.6 0 0.8 0 0 0 0 A27 0 0 0 0 0 0 0 0 0 0 0 0 0 0.2 0 0 0 0 0 0 0 0 0 A28 0 0 0 0 0 0 0 0.4 0.8 0.8 0.6 0 1 0.6 0.4 0 0.6 0.8 0 0.8 0.8 0.8 0.6 A29 0.2 0.2 0.2 0.4 0.6 0.2 0.2 0.2 0.4 0.2 0.4 0.2 0.4 0.4 0.6 0.6 0.8 0.4 0.4 0.2 0.4 0.4 0.6 A30 0.6 0.6 0.4 0.6 0.6 0.4 0.4 0.4 0.4 1 0.4 0.2 0.4 0.4 0.6 0.6 0.6 0.4 0.6 0.2 0.2 0.2 0.6 A31 0.2 0.2 0.2 0.4 0.6 0.2 0.2 0.2 0.2 0.4 0.4 0.2 0.4 0.4 0.2 0.6 0.8 0.2 0.4 0.2 0.2 0.4 0.6 Appendix 3B RE Techniques Assessment (Empirical) Data (2) T24 T25 T26 T27 T28 T29 T30 T31 T32 T33 T34 T35 T36 T37 T38 T39 T40 T41 T42 T43 T44 T45 T46 A1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0.4 0 0 0 1 0 0 0 0 A2 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0.6 0 0 0 0 A3 0 0 0 0 0 0 0 0 0 0 0.4 0 0 0 0 0 0 0 0.4 0 0 0 0 A4 0 0 0 0 0 0 0 0 0 0 0.2 0 0 0 0 0 0 0 0 0 0 0 0 A5 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0.4 0 0 0 0 A6 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0.2 0 0 0 0 A7 0 0 0 0 0 0 0 0 0 0 0.8 0 0 0 0 0 0 0 0 0 0 0 0 A8 1 1 1 1 1 1 0 1 0 0 0 0.8 0.6 0 1 1 1 1 1 0 0 0 0 A9 0.8 0.2 1 0.8 0.8 1 0 1 0 0 0 1 1 0 0.8 0.4 0.4 0.4 0.8 0 0 0 0 A10 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 A11 0.4 0 0.4 0.4 0.4 0.2 0 0.2 0.6 0.2 0.8 0.2 0.2 0 0.8 0 0 0 1 0 0 0 0 A12 0 0 0.2 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0.8 0 0 0 0 A13 0.6 0.8 0.8 0.8 0.8 0.6 0 0.8 0 0 0.6 0.6 0.6 0 0.6 1 1 1 0.2 0 1 1 1 A14 0.6 0 0 0 0.6 0.6 0 0 0 0 0.6 0 0 0 1 0 0 0 0.8 0 0 0 1 A15 0 0 0 0 1 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 A16 1 1 1 1 1 1 0 1 0 0 0 1 1 0 1 1 1 1 0.8 0 0 0 0 A17 0.8 0.8 0.4 0.4 0.4 0.4 1 0.6 0 0 0.4 0 0 0.4 0.6 1 1 1 0.4 0.4 1 0.8 0.8 A18 0.6 0.8 0.6 0.6 0.6 0 0 0.6 0 0 0 0.6 0.6 0.4 0.8 1 1 1 0 0 0 0 0 A19 0.8 1 1 1 0.8 0.6 0 0.8 0 0 0 0.6 0.6 0 0.8 1 1 1 0.4 0 0 0 0 A20 0.6 0.6 0.6 0.6 0.6 1 0 0.8 0 0 0.8 0.6 0.6 0 0.8 0.6 0.6 0.6 0.4 0.6 0.6 0.6 0.6 Legend: 1. See Appendix 1 for the technique name that each T j represents in the table. 2. See Appendix 2 for the attribute name that each Ai represents in the table. A21 0 1 1 1 0.8 0.8 0 0.8 0 0 0 0.6 0.6 0 0.8 1 1 0.4 0.8 0.8 0.8 0.8 0.8 A22 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.8 0.8 0.8 0.8 0 0 0 0 0 A23 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.6 1 1 1 0 0 0 0 0 A24 0.6 0.6 0 0 0 0 0.8 0 0 0 0 0 0 0.8 0.6 1 1 1 0.4 0.4 0 0 0 A25 0.2 0.6 0.4 0.4 0.4 0.2 0.8 0.4 0 0 0 0 0.2 0.4 0.8 1 1 1 0.4 0.4 1 1 1 A26 0 0 0 0 0 0.8 0.6 0.2 0 0 0.8 0.2 0.2 0.8 0.4 0 0 0 0 0.8 0.8 0.8 0.8 A27 0 0 0 0 0 0 0 0 1 1 1 0 0 0.4 0 0 0 0 0 0 0 0 0 A28 0.6 0.8 0.8 0.6 0.8 0 0.6 0.8 1 0 0.8 0.8 0.8 0.4 1 0.8 0.8 0.8 0.6 0 0 0 0 A29 0.6 0.8 0.4 0.6 0.6 0.4 0.8 0.4 0.6 0.2 0.8 0.2 0.2 0.8 0.6 1 1 1 0.4 0.6 0.2 0.2 0.2 A30 0.6 0.8 0.6 0.6 0.4 0.4 0.6 0.4 0.4 0.4 0.4 0.2 0.2 0.8 0.4 0.8 0.8 0.8 0.4 0.4 0.6 0.6 0.6 A31 0.4 0.8 0.4 0.6 0.2 0.4 0.8 0.4 0.4 0.2 0.8 0.2 0.2 0.8 0.6 1 1 1 0.4 0.6 0.2 0.2 0.2 Appendix 4. An Example of Generated Dataset after Dimension Reduction Techniques T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 T13 T14 T15 T16 T17 T18 T19 T20 T21 T22 T23 T24 T25 T26 T27 T28 T29 T30 T31 T32 T33 T34 T35 T36 T37 T38 T39 T40 T41 T42 T43 T44 T45 T46 D1 -1.892500 -1.630000 -1.168700 -1.914800 -1.672900 -1.915200 -1.658000 -0.779230 -1.739400 -0.237480 -0.016696 -0.245810 -0.881600 0.418380 -1.210400 -1.706900 1.574500 0.795750 -1.737100 0.929870 0.839770 1.185200 1.420900 1.366700 1.447800 1.304900 1.290600 1.245400 0.867110 -0.007875 1.306100 -0.517170 -0.686850 -1.888500 0.720930 0.708600 -0.194780 1.630800 1.946700 1.946700 1.785700 0.054956 0.047944 0.047944 0.022047 0.025468 D2 0.079340 0.103430 0.379490 -0.322170 -0.824850 -0.281250 -0.170510 -0.704200 -0.267080 -0.742650 -0.653690 0.289370 -0.435800 -1.243800 -1.325600 0.066620 -0.358890 0.607680 -0.693110 -0.824860 -0.197750 -0.240130 -0.122170 -0.307620 0.015373 -0.332190 -0.232910 -0.420970 -0.200102 1.392200 -0.344460 0.985160 1.276400 -0.749610 -0.257850 -0.152110 1.315200 -0.573940 -0.116060 -0.116060 -0.104740 -0.089951 1.332400 1.332400 1.330100 1.145700 D3 -0.206770 -0.414710 -0.075140 -0.503340 -0.532970 -0.304210 -0.246680 0.290550 -0.079758 0.577910 0.191440 1.181900 0.344940 -0.055321 -0.576650 -0.191640 -0.900520 0.112670 -0.552530 0.653110 1.017000 0.573550 -0.191520 -0.257750 0.507080 0.460200 0.270880 0.596130 0.585730 -0.579310 0.462860 0.920430 0.891440 -0.445390 0.940860 0.864600 -0.117810 -0.111530 -1.360400 -1.360400 -1.313500 0.375560 -0.530030 -0.520030 -0.450630 -0.397900 Notes: (1) Di represents the generated new dimension i . (2) See Appendix 1 for the name of Ti D4 0.030650 0.304470 -0.112600 0.045320 0.094020 0.018030 0.195570 -0.309960 -0.332310 0.380390 0.623020 0.237940 -0.266430 0.169370 0.388760 -0.372020 -0.407790 -0.698890 0.363110 0.347560 -0.078768 0.090900 0.079606 -0.019063 -0.169040 0.100120 0.095809 0.435600 1.129800 -0.263990 0.425450 -0.943720 -0.498060 -0.356510 -0.086485 -0.064430 -0.552300 0.254390 -0.488250 -0.488250 -0.617980 0.136670 1.088900 1.088900 1.552000 1.338500 D5 -0.419870 -0.663370 -0.558350 -0.617720 0.177100 -0.334170 -0.345710 -0.197310 -0.234040 0.135690 0.262950 0.801180 -0.320780 0.050422 0.434210 -0.319680 0.725390 -0.738230 0.218520 -0.026437 -0.242140 -0.178770 -0.142130 -0.084030 -0.276290 -0.276920 -0.313150 0.019795 0.120070 0.640890 -0.253730 0.940190 0.653350 0.868440 -0.430780 -0.422720 0.629180 0.628220 -0.004545 0.004545 0.089288 0.790240 -0.096070 -0.059607 -0.058295 0.273890 D6 -0.069110 -0.106870 0.117600 0.004040 -0.117170 0.214430 0.303790 0.635260 -0.823150 -0.899700 -0.710950 -0.039679 -0.319430 0.072067 -0.037322 -0.354170 0.279240 0.387110 0.005525 -0.033600 0.009505 -0.047044 0.055687 -0.012172 -0.027596 0.227060 0.118010 0.014297 -0.117510 -0.586240 0.097974 0.480050 0.420860 0.959770 0.234140 0.296120 -0.749110 -0.013977 -0.075070 -0.075070 -0.277230 -0.087457 0.323940 0.323940 0.293840 0.176250