Download Toward Mining “Concept Keywords” from Identifiers in Large

Toward Mining “Concept Keywords” from Identifiers in Large Software Projects Masaru Ohba and Katsuhiko Gondow Tokyo Institute of Technology What are “concept keywords”? • Most programmers try to name identifiers meaningfully. • Concept keywords are defined terms that describe key concepts to aid in as program understanding. – e.g. read_dirent() : dirent is a concept keyword. Concept keywords G rouping words Attributes, less im portant concepts G eneric verbs d ire n t, root, PTE , tss, path, sig nal, yield kbd_ , vg a_ , FAT12_ , sys_ , H, t busy, byte, offset, nam e, m em ory, end, int8, ag ain re a d , set, is, m ove, wait, print, dum p, m ake, init Human-selected concept keywords and other category words in udos Suggestion • We should use more “concept keywords” in program understanding tools． – concept keywords are concise and descriptive • Our solution: – provides a way to mine concept keywords. • ckTF/IDF methods / Identifier Exploratory Framework – could be used to build tools that support and utilize extracted concept keywords (future work). Future work • Applying concept keywords to a Bug Tracking System (BTS) to see the relationship between bug report and corresponding problem source code. fat12.c Bug-report no.1 Overview: It could not read directories. dirent read_dirent() { return NULL; } task.c signal Bug-report no.3 Overview: I could not catch system calls. sys_signal(){ sys_kill(); } Concept keyword can bridge the gap between bug-reports and source code. IBM Watson Research Center Source code that talks: an exploration of Eclipse task comments and their implication to repository mining Annie Ying (joint work with Jim Wright & Steve Abrams) © 2005 IBM Corporation Annie Ying et. al., IBM Research In a software development task... task-oriented info development artifacts communication reqs change reports class Foo class Foo {{ emails // Joan, please fix this } void m1() { } © 2005 IBM Corporation Annie Ying et. al., IBM Research Empirical study on Eclipse task comments Eclipse task comments // TODO an ugly hack for now –sue. Joan, please fix it // TODO eliminate this once ECR 317 complete © 2005 IBM Corporation Annie Ying et. al., IBM Research Conclusion  Presented observations on uses of comments – e.g., task-oriented info and communication  Take-home message: – When mining software repositories, consider analyzing comments. © 2005 IBM Corporation Annie Ying et. al., IBM Research The End © 2005 IBM Corporation Annie Ying et. al., IBM Research Challenges in analyzing Eclipse task comments informality Eclipse task comments // TODO an ugly hack for now –sue. Joan, please fix it implied context // TODO eliminate this once ECR 317 complete // TODO explain why this method is public // TODO once we have Eclipse-icon-decorator mechanism, use it here // TODO workaround for ... ... // End workaround fuzzy scope © 2005 IBM Corporation Text Mining for Software Engineering: How Analyst Feedback Impacts Final Results Jane Huffman Hayes, Alex Dekhtyar, Senthil Karthikeyan Sundaram *Funded by NASA Department of Computer Science University of Kentucky Question of the Day What can Data Mining Do for Software Engineering ??? Question of the Day Answer 1 What can Data Mining Help study the process After-the-fact Exploratory Conclusions help future projects Do for Software Engineering ??? Question of the Day Answer 1 What can Data Mining Help study the process Answer 2 Help improve the process After-the-fact Exploratory Conclusions help Do for Software future projects Engineering ??? !!! Our Approach Use Data Mining during the process Use Mining During the Process? Final Result Task Feedback Loop Analyst Ultimately, We are interested In the accuracy Of the final result Automated “Mining” Tool Objective Study(RE’04,PROMISE’05) Subjective Study Preliminary Study Question: What would the analyst do with machine-generated data? Final Result Analyst Task : Requirements Tracing Metrics: Precision Recall Automated “Mining” Tool Preliminary Study Question: What would the analyst do with machine-generated data? Final Result Analyst Pr 40% 20% 80% Rec 60% 90% 30% Candidate link lists Preliminary Study Question: What would the analyst do with machine-generated data? Analyst Pr 40% 20% 80% Rec 60% 90% 30% Candidate link lists Pr 45% 58% 23% Rec 56% 65% 27% Preliminary Study Pr 45% 58% 23% Question: What would the analyst do with machine-generated data? Rec 56% 65% 27% ΔPr ΔRec 100 +5% -4% +38% -25% -57% -2% Analyst 90 80 Rec 60% 90% 30% Candidate link lists 60 Recall Pr 40% 20% 80% 70 Trend??? 50 40 30 20 10 0 0 10 20 30 40 50 60 70 80 Precision T1 T3 T4 From RE2003 reg 90 100 (Not Quite) Conclusions Final Result Task Feedback Loop Automated “Mining” Tool Analyst (Not Quite) Conclusions Final Result Analyst • New Field of Study • Larger Study Needed Call for Help! WANTED! VOLUNTEERS! Thank You! Signature Change Analysis Sunghun Kim, Jim Whitehead, Jennifer Bevan {hunkim, ejw, jbevan}@cs.ucsc.edu University of California, Santa Cruz Biological and Software Evolution Biological and Software Evolution v1 v2 v3 Biological and Software Evolution • Can we shape software evolution path? v1 v2 v3 – – – – LOC Number of Changes Structural Changes Signature Changes Found Signature Change properties • The most common signature change kinds are complex data type, parameter addition, parameter ordering, and parameter deletion. 60 50 A 1.3 A2 APR APU CVS GCC SVN AVG 40 30 20 10 0 Parameter name change Only ordering change Addition Deletion Modifier change Array/Pointer Complex type name change Primitive type change Found Signature Change properties • • • • More than half of function signatures never change. About 90% of function signatures change less than three times. A function’s signature changes after every 5-15 function body changes. A project’s average number of parameters per function remains relatively constant over time. Functions typically have parameter lists with 1, 2, or 3 parameters. Found Signature Change properties • • Weak correlations between signature change and other changes including LOC and function body changes. Each project has its own signature change patterns, and the pattern can be discovered after analyzing the first 1000 to 1500 revisions. SVN A 1.3 60 60 100 200 300 500 1000 1500 2000 5000 6029 50 40 30 20 10 0 100 200 300 500 1000 1500 2000 5000 7747 50 40 30 20 10 0 Parameter name change Only ordering changes Additon Deletion Modifier change Complex type name change Parameter name change Only ordering changes Additon Deletion Modifier change Complex type name change Found Signature Change properties • Probability of a change kind depends on previous changes. 0.07 0.58 A 0.04 D O 0.38 O 0.22 C O 0.81 C 0.61 0.09 C 0.83 0.15 C 0.58 0.17 C O C 0.76 C 0.73 0.94 C C (a) APR 0.16 0.51 A 0.19 A 0.11 D 0.33 C 0.27 A O 0.21 C 0.18 O C 0.36 C C (b) Apache 2 0.66 C 0.78 0.73 0.53 C 0.1 C 0.61 C Future Work • Signature change analysis on OOP (Java) – The results presented here are based on a procedural programming language (C) open source projects: Apache HTTP 1.3, Apache HTTP 2.0 , Apache Portable Runtime, APR utility, CVS, GCC, and Subversion – Find OOP signature change properties and compare the with those from a procedural language • Changes inside Struct/Class – Variable addition/deletion – Variable renaming – Method addition/deletion Signature Change Analysis Sunghun Kim, Jim Whitehead, Jennifer Bevan {hunkim, ejw, jbevan}@cs.ucsc.edu University of California, Santa Cruz Linear Predictive Coding and Cepstrum coefficients for mining time variant information from software repositories G. Antoniol, F. Rollo and G. Venturi RCOST – Unievrsity of Sannio - Italy LPC Idea  Model a time series with a polynomial approximation     LPC Cepstrum smooth the spectrum Define the distance between two time series as the distance between their polynomial approximations Use distance to cluster time series with identical or similar evolutions. LPC and Linux Kernel Similar pairs for different thresholds and coefficients used  10000 1E-3 1000  1E-4 1E-5  100 12 16 20 32 Similar pair of evolving files 800  211 Linux releases about 1700 files Study the influence of the number of coefficients Study the influence of distance thresholds Mine files with similar evolution:  700 600 500 400 300 200 100 0 1 14 27 40 53 66 79 92 105 118 131 144 157 170 183 196 209 222 235 248 Create groups of files with the same or very similar size evolution Complementing Each Other: GQM & DMAIC GQM DMAIC (Define-Measure-AnalyzeImprove-Control) (Goal-Question-Metric) • CMM sometimes criticized for emphasizing repeatability over improving productivity. • Six Sigma sometimes criticized as inappropriate for processes characterized by knowledge efforts. • GQM strong in defining metrics appropriate to business goals and nature of the process. • DMAIC strong in focus on continuous iterative process improvement. CMM+6σ Process Improvement Cycle Define Measure Analyze Control Define Improve Control Improve Baselines Weaknesses Opportunities Measure Hypotheses Trends Indicators Causes Analyze Progress Defects Delays Dissatisfactions Collect Data Assess Requirements Activities Changes Time Results Areas of Concern • Architecture – Design weaknesses – General or for new demands • Bottlenecks – Areas for focused attention • Causal Connections – System view of process – Root cause analysis Mining Version Histories to Verify the Learning Process of LPP • Mining the Boundary of Openness of an Open Source Software Project • Explore if we can apply Open Source Development (OSD) Process to Proprietary Software • Show the Boundary of Openness during OSD National Chiao Tung University Shih-Kung Huang, Kang-min Liu Method • Team Members – Core= Relatively Important Developers – NonCore = All – Core • Source Code – Kernel = All – NonKernel – NonKernel = {d | d is touched by one of the NonCore} • Project Characteristic function – f(x) = {y | y is the kernel ratio with respect to the core ratio of x} – Kernel Ratio = (Kernel Size)/All – Core Ratio = (Core Team Size) / All gallery phpmyadmin moodle GCC Slashcode Pugs Conclusions • Obtain the characteristic function of each project team – Reveal different team consititutions with varied involvement in the software • An Implication to develop a hybrid software process model to embed OSD into commercial software. – OpenDarwin: Mac OS X – Helix: Real Network Server Towards a Taxonomy of Approaches for Mining of Source Code Repositories Huzefa H. Kagdi, Michael L. Collard, Jonathan I. Maletic Software Development Laboratory <SDML> Department of Computer Science Kent State University Kent Ohio, USA Motivation • A number of approaches have been proposed to derive and express changes from source code repositories in a more source-code “aware” manner • We need better insight of the current research in the MSR community in order to facilitate building efficient and effective MSR tools Building a Taxonomy • Draw similarities and variations between six MSR approaches based on three dimensions – Entity type and granularity – How changes are expressed and defined – Type of MSR question • Define notations to describe MSR to facilitate a taxonomic description of approaches An Initial Taxonomy Entity Change Question Gall et al class syntax and semantic -hidden dependencies market basket and prevalence German file & comment syntax and semantic - file coupling market basket and prevalence function & variable syntax and semantic -dependencies market basket class & method syntax and semantic - association rules market basket Raghavan et al logical statement syntax and semantic - move prevalence Collard et al logical statement syntax - add, delete, modify prevalence Annotation Analysis Heuristic Hassan et al Data Mining (association rule) Zimmerman et al Differencing Conclusions • Most of the approaches except Differencing work with fairly high-level entities • Very different semantic information being is used in these approaches • Further investigation is necessary to discern between how changes are expressed A Framework for Describing and Understanding Mining Tools in Software Development D.M. German, D. Čubranić, and M.-A. Storey University of Victoria Introduction • Software engineering is a collaborative activity → activity awareness is important • Can be provided by mining software repositories • A variety of mining tools → how to compare? • Do we mine what is easy to mine and think about the uses for it later? Proposal • Develop a framework for describing tools for mining software repositories • Purpose: • Help designers understand and compare tools • Assist users assess tools • Identify new research areas • Keep the specific user needs and tasks in the forefront! The Framework • Intent • Role, time, cognitive support • Information • Change management, program code, defect tracking • Informal communication, local history, correlated information • Infrastructure • Requirements, offline/online, storage backend What Next? • Applied the framework to three tools: • softChange • Hipikat • Xia/Creole • We invite researchers to apply it to their tools and give us feedback on their experiences

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Toward Mining “Concept Keywords” from Identifiers in Large