Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Automatic Data Categorization in Issue Tracking Systems A Research Preview Thorsten Merten1 1 Bonn-Rhein-Sieg University of Applied Sciences Barbara Paech2 Bonn-Rhein-Sieg University of Applied Sciences, Department of Computer Science, Germany [email protected], [email protected] 2 University of Heidelberg, Software Engineering Group, Germany [email protected] ITS DATA MINING ISSUE TRACKING SYSTEM DATA • Most data in an Issue Tracking System (ITS) is stored in user defined text fields • Data mining approaches try to extract and use this data • Categorization of ITS data can support multiple SWE tasks, e.g. (cf. right figure) ◦ creating a new sub-issue if necessary ◦ marking or changing wrongly entered manual ITS data ◦ marking relevant information for a certain task ◦ removing unnecessary information • Issue Tracking Systems contain valuable Software Engineering information, e.g. ◦ requirement and feature descriptions ◦ development and refactoring tasks ◦ bug reports and bug fixing tasks ◦ discussions and dissent ◦ ... • This information fits in multiple categories ◦ implementation ideas ◦ stack traces or error messages ◦ social interaction ◦ ... OUR APPROACH AN EXAMPLE FOR ALGORITHMIC IMPROVEMENTS • To support automatic creation of change logs • Some prerequisites can be used ◦ The separation of technical items and natural language texts already works for developer mailing list classification [1] ◦ Semi-supervised learning can be used to separate relevant change log information but precision and recall are generally low • Preprocessing is very important in Text Mining [2, pp. 84] • We can use the specialized algorithm to seperate technical items as preprocessing step for general classification [5, 4] CURRENT AND FUTURE WORK • Literature Reviews on algorithms and solutions and ITS supported SWE tasks • Identification of ITS supported SWE tasks • Identification of data categories to support these tasks • Use of rule-based and specialized algorithms for preprocessing • General idea: Improvement of semi-supervised learning by combining multiple methods. • Experimental validation of algorithms and measuring improvements on ITS tasks http://www.brsu.de – http://www.inf.h-brs.de AN EXAMPLE FOR TASK-BASED UTILIZATION Identify Documentation is a SWE task. The creation of a change log is a sub discipline, which can be supported by ITS data. Hypothesize We can improve the categorization of task-relevant and irrelevant natural language text passages by separating technical information. Categorize We implement different algorithms to separate technical and natural language information and combine those (as preprocessing) with different semi-supervised learning algorithms. Evaluate We apply the above categorization to create change logs and evaluate by a) comparison with manually written change logs and b) surveying developers. We may now go back to the categorize step and improve the method. REFERENCES [1] A. Bacchelli, T. Dal Sasso, M. D’Ambros, and M. Lanza. Content classification of development emails. 2012 34th International Conference on Software Engineering (ICSE), pages 375–385, June 2012. [2] J. Han, M. Kamber, and J. Pei. Data Mining: Concepts and Techniques, Third Edition (The Morgan Kaufmann Series in Data Management Systems). Morgan Kaufmann, 3rd edition, 2011. [3] S. T. March and G. F. Smith. Design and natural science research on information technology. Decision Support Systems, 15(4):251–266, 1995. [4] R. E. Vlas and W. N. Robinson. Two Rule-Based Natural Language Strategies for Requirements Discovery and Classification in Open-Source Software Development Projects. Journal of Management Information Systems, pages 1–39, 2012. [5] T. Zhang and B. Lee. A Bug Rule Based Technique with Feedback for Classifying Bug Reports. In 2011 IEEE 11th International Conference on Computer and Information Technology, pages 336–343. IEEE, Aug. 2011. http://www.uni-heidelberg.de