Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
A SOFTWARE ENGINEERING TOOL FOR DISTRIBUTED DEVELOPMENT Jason Carter Prasun Dewan University of North Carolina at Chapel Hill MOTIVATION Grrr… Hmm… is Bob stuck? Programmer Bob Programmer’s Mentor/Teammate 2 APPLICATIONS Need Help? Student Manager Offer help to student programmers who are too shy to ask for it Significantly improve programmer productivity Benefits of this idea may occur in industry 3 CO-LOCATED VS. DISTRIBUTED Distributed Team Less productivity More productivity Co-Located Team [2] Herbsleb, J.D., et al. Distance, dependencies, and delay in a global collaboration. in Proc. CSCW 2000. 4 PRODUCTIVITY HIGHER IN WAR-ROOMS THAN IN CUBICLES War-room Cubical Combined these studies show Teasley, S., et al. How does radical collocation help a team succeed? in Proc. CSCW 2000. 5 DISTANCE IMPEDES DEDUCTION Hmm… is Alice stuck? Bob Grrr… Distance Developers often do not explicitly ask for help Alice How do we reduce this gap? 6 COLLABVS Developers are aware of methods their distributed teammates are working on Use this information with project information to manually determine if teammate is stuck Distributed users the feeling of “being there” in a single location Hedge R. and Dewan P. Connecting Programming Environments to Support Ad-Hoc Collaboration in 23rd IEEE/ACM International Conference on ASE. 2008. 7 CAN WE DO BETTER THAN BEING THERE? Face-to-Face Interaction - “Being There” Bob Alice “Beyond Being There” Bob Alice How do we plan to go “beyond being there”? Hollan, J. and Scott S. Beyond being there. CHI ’92. 8 “BEYOND BEING THERE” You are having difficulty. Programmer Bob Bob is having difficulty. Programmer’s Mentor/Teammate There are several ways to infer this information 9 AUTOMATIC PREDICTION OF FRUSTRATION PROBLEM: Overhead of using this non-standard equipment Posture Seating Chairs Wireless Bluetooth skin conductance tests Pressure Mice Alternative approach Video Camera Determine this information by logging interaction with some component of the system Kapoor, A., Burleson, W., and Picard, R.W. "Automatic Prediction of Frustration," International Journal of Human-Computer Studies, Vol. 65, Issue 8, 2007. 10 DETERMINE IF PROGRAMMERS ARE INTERRUPTIBLE 0 Interruptibility 100 Developed a tool that uses developers’ actions to determine if they are interruptible Randomly interrupted developers Can we use the this approach? Fogarty, J., Ko, A.J., Aung, H.H., Golden, E., Tang, K.P. and Hudson, S.E. Examining task engagement in sensor-based statistical models of human interruptibility. In Proc. CHI 2005, ACM Press (2005), 331-340. 11 INFORMATION ABOUT EVENTS 0 Interruptibility 100 No random interruption would find a developer is having difficulty A better alternative is to allow developers to report their status 12 USE BUTTONS TO COLLECT INFORMATION Buttons used to indicate status “Eureka Button” To capture situations in which developers did not realize they had been having a problem until they had solved it “Notifications Enabled” Allowed developers to determine if they received status change notifications These buttons are useful only for the training phase Useful to run an initial naïve algorithm 13 OUR APPROACH Use Buttons to Gather Data Determine a Measure of Progress Using a Naive Algorithm Derive Mining Algorithm Validate our algorithm Build a Model 14 BASIC INTUITION You are having difficulty. Monitor progress of developers progress < than some threshold Threshold indicate that they are having difficulty 0 Progress 100 Productivity Progress related but fundamentally different 15 RELATIONSHIP BETWEEN PRODUCTIVITY AND PROGRESS Usually measured after developers have written code Productivity Measured while developers write code Progress Little work has been done on measuring progress The only work we could find was done by Kersten and Murphy 16 MYLAR: TOOL TO REDUCE NAVIGATION # of Edit Commands Edit Ratio # of Navigation Commands Kersten, M., Murphy, G. C., Mylar: A degree-of-interest model for IDEs. In Proc. Conference on Aspect-Oriented Software Development, 2005, 159-168. 17 OUR APPROACH: DETERMINE MEASURE OF PROGRESS Use Buttons to Gather Data Determine a Measure of Progress Using a Naive Algorithm Derive Mining Algorithm Validate our Algorithm Build a Model 18 METRICS TO MEASURE PROGRESS # of Debug Commands Edit Ratio Low Threshold Participants Tasks 3 freshman Class Assignments 3 graduate students Class and Research Assignments Stuck Having Difficulty Naïve algorithm did not predict the progress status well Explore the logs and corrections to derive a better algorithm Y. Sharon. Eclipseye—spying on eclipse. Bachelor’s thesis, University of Lugano, June 2007. 19 OUR APPROACH: DERIVE MINING ALGORITHM Use Buttons to Gather Data Determine a Measure of Progress Using a Naive Algorithm Derive Mining Algorithm Validate our Algorithm Build a Model 20 DERIVING MINING ALGORITHM Analyze Logs To find patterns when developers indicated they were having difficulty Features Values that change when programmers are making progress and having difficulty A manual inspection of the logs showed that the frequency of certain edit commands decreased when developers were having difficulty 21 COMMANDS GROUPED INTO FIVE CATEGORIES Depending on the developer, the frequency of execution of other commands increased Navigation • Switch between files • Search for code Edit Remove Debug • Edits program code (insertion /deletion) • Remove a class or method • Debug or Execute code Focus • Eclipse loses focus • Eclipse gains focus We used this categories to create our features 22 IDENTIFYING FEATURES For different segments of the log we calculated: the occurrences of each category of commands in that segment total # of commands in the segment * 100 Used these percentages as features to identify patterns The size of these segments is an important issue 23 DETERMINING SEGMENT SIZE <action> <eventType>SOLUTION_OPENED</eventType> <solutionEvent> <timestamp>9/20/2009 12:44:02 PM</timestamp> </solutionEvent> </action> <action> <commandEvent> <command>Debug.Start</command> <timestamp>9/20/2009 12:45:33 PM </timestamp> </commandEvent> </action> <action> <eventType>WINDOW_LOST_FOCUS</eventType> <windowEvent> < timestamp> 12:46:01 PM</ timestamp > </windowEvent> </action> <action> <eventType>WINDOW_GAINED_FOCUS</eventType> <windowEvent> < timestamp> 12:48:01 PM</ timestamp > </windowEvent> </action> Segment Sizes: Whole Log 200 100 25 50 Graphed the programming behavior of all participants to determine usefulness of features 24 OUR APPROACH: VALIDATE ALGORITHM Use Buttons to Gather Data Determine a Measure of Progress Using a Naive Algorithm Derive Mining Algorithm Validate our Algorithm Build a Model 25 GRAPHS TO VALIDATE FEATURES 26 GRAPHS TO VALIDATE FEATURES (CONT.) 27 GRAPHS TO VALIDATE FEATURES (CONT.) The two graphs validate our feature choice show that a general model must account for differences in what percentages change when developers are having difficulty There are several standard ways to build a general model 28 NUMBER OF STUCK EVENTS SIGNIFICANTLY LESS THAN TOTAL NUMBER OF EVENTS Breakdown of Events Making Progress Events 2212 Having Difficulty Events 76 Total Events: 2288 This leads to the imbalance class distribution problem 29 IMBALANCED CLASS DISTRIBUTION Needle in a haystack Disproportionate number of having difficulty segments to making progress segments “Standard” algorithms to predict making progress ~97% of the time Actual Status Predicted Status Making Progress Making Progress Making Progress Making Progress Making Progress Making Progress Having Difficulty Making Progress Making Progress Making Progress Making Progress Making Progress Accuracy of this model: 83% Problem: Model can’t identify when a developer is having difficulty 30 SMOTE ALGORITHM Replicates rare data, having difficulty, until there is more of a balance between having difficulty statuses and making progress statuses The replicated data of all developers were combined and used as input to several standard algorithms to build a model Making Progress Having Difficulty 2212 1216 76 31 BUILD A MODEL Applied mining algorithms Logs – Replicated Data Participant1-2 Participant3-4 Standard 10 fold cross-validation Model Participant5-6 10 trials of model construction executed Each trial used 90% of data for training The remaining 10% used as test data to evaluate the model in that trial 32 ACCURACY OF MODEL USING DECISION TREE ALGORITHM Actual Having Difficulty Actual Making Progress Decision Tree Model Predicted Having Difficulty Predicted Making Progress 1101 115 158 2130 Overall Accuracy: 92% Predicted Having Difficulty: 90% Developers were having difficulty 1216 times Developers were making progress 2288 times Witten, I.H. and Frank, E. (1999) Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann. 33 CLASSIFICATION VIA CLUSTERING Designed to identify rare events without replicating records Classification via Clustering Model Predicted Predicted Having Making Difficulty Progress Actual Having 60 16 Difficulty Actual Making 896 1392 Progress Overall Accuracy: 67% Predicted Having Difficulty: 79% Developers were having difficulty 76 times Developers were making progress 2288 times Witten, I.H. and Frank, E. (1999) Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann. 34 OUR APPROACH Use Buttons to Gather Data Determine a Measure of Progress Using a Naive Algorithm Validate our algorithm Build a Model Derive Mining Algorithm Approach is promising Left several unanswered questions 35 DOES APPROACH WORK IN PRACTICE? Decision Tree Model Features Edit Percentage Debug Percentage Focus Percentage Classification Via Clustering Model Navigation Percentage Remove Percentage Research group and one industrial developer used software Learned several important lessons 36 HAVING DIFFICULTY LESSON Stuck button and the having difficulty status hurt my advisor’s ego 37 FREQUENT FALSE POSITIVES LESSON Workflow system Industry Developer Building a new product Started a new session The navigations performed to build the working set of files Sometimes needed more time to determine if the predicted change of status was correct 38 LABEL AGGREGATION TECHNIQUE 50 50 50 50 Compute Features Compute Features Compute Features Compute Features # of events 250 Two techniques account for the fact that developers’ status does not change instantaneously To give a more detailed explanation of how this works 39 IMPLEMENTATION OF DEVELOPER NOTIFICATIONS There are 5 status predictions when reporting a dominant status every 250 events Status Slow Progress Prediction Indeterminate Status Slow Progress Prediction Slow Progress Status Slow Progress Prediction Slow Progress Status Slow Progress Prediction Slow Progress Status MakingPrediction Progress Making Progress Making Progress Allowed the developer to correct a predicted status to indeterminate 40 RESULTS OF PILOT STUDY There was a total of 88 predictions made Every hour we switched models without interrupting the user Model # of predictions # of times user corrected status Accuracy Decision Tree 32 3 90% Classification via Clustering Model 56 10 82% Status # of times guessed # of times corrected Accuracy Slow Progress 17 2 88% Making Progress 69 7 89% Indeterminate 0 100% 2 Accuracy of this study is good Large number of false negatives How do we improve our accuracy? 41 COST OF PROCESSING INCREMENTAL INPUT EVENTS LESSON Advisor noticeable intolerable 3-year old laptop 42 CHANGES IN THE TOOL INCREASED PROGRAMMING TIME AND EFFORT Algorithm Decision Tree Model Classification Via Clustering Model Do not share code 43 SOLUTION TO CREATING NEW CODE FOR EACH PROGRAMMING ENVIRONMENT Build an architecture that is independent of the programming environment Decision Tree Algorithm Architecture Classification Via Clustering Algorithm Supports interoperability Also put process on server 44 ARCHITECTURE Made up of several modules 45 EVENT-INTERCEPTION MODULE Developers’ actions WOX XML Developers’ actions Serialized object This module does several things: Captures events from both Eclipse and Visual Studio Passes these events to the prediction modules Prediction modules are written in C# so events from Visual Studio could be passed directly Java events were converted to C# using standard libraries 46 PREDICATION MODULES Mediator WOX IKVM Event Aggregator Feature Extractor Prediction Manager Status Aggregator Previous Model Mediator allows modules to be loosely coupled We can use several algorithms for event aggregation 47 DISCRETE CHUNKS/SLIDING WINDOW AGGREGATION ALGORITHM Discrete chunk of 3 events Window, Window Size = 3 Can this tool work with professional programmers? 48 CONTROLLED USER STUDY 14 developers 9 student programmers 5 industry programmers Having difficulty is rare Make sure developers face difficulty during the study Tasks are no impossible to solve We use ACM programming problems 49 ACM PROGRAMMING PROBLEMS Mid-Atlantic ACM Programming Contest http://midatl.radford.edu/ Year Problem Title # of Teams that Solved Problem # of Teams Percentage Correct 2006 Shrew-ology 43 138 31.16% 2004 Balanced Budget Initiative 23 142 16.20% 2002 A Simple Question of Chemistry 124 124 100% Is self reporting reliable? http://midatl.radford.edu/ 50 PROBLEM WITH SELF REPORTING Previous research I look good!!! over-confident in evaluating their own situations I don’t think so judgments by observers are more reliable than self-reports possibly due to people's perception of themselves Shrauger, J. S. and Osberg, T. M. (1981). The Relative Accuracy of Self-Predictions and Judgments by Others in Psychological Assessment. Psychological Bulletin 90(2). 322-351. 51 EXPERIENCES WITH SELF REPORTS Developers had a hard time indicating they were having difficulty ‘Having difficulty’ status hurt developers' egos Problem addressed in our CSCW Video Hurt ego could cause developers to misrepresent their status Coders observe developers’ programming activities Independent determination of their status 52 OBSERVE DEVELOPERS Coder Developer Record programming sessions and give them to two coders Recorded 40 hours and 44 minutes of video How do we relieve coders from having to focus on hours of video? Tang, J. C., Liu, S. B., Muller, M., Lin, J., and Drews, C. 2006. Unobtrusive but invasive: using screen recording to collect field data on computer-mediated interaction. In Proceedings of the 2006 20th Anniversary Conference on Computer Supported Cooperative Work (Banff, Alberta, Canada, November 04 - 08, 2006). CSCW '06. ACM, New York, NY, 479-482. DOI= http://doi.acm.org/10.1145/1180875.1180948 53 RELIEVING CODERS No Status Created a tool that marks “points of interest” or segments for coders to view Stuck Tool Participants Segments: Few stuck segments, so coders classified all of these Many making progress segments Me (while observing the participants writing code) How did we choose progress segments? Coders would have to watch over 40 hours of video! 54 RELIEVING CODERS Slow Progress Progress Coders were not aware of status of each segment Randomly sampled the making progress segments Made the number of random segments the same as the number of having difficulty segments Each segment was two minutes of video Two coders and the first author classified 26 stuck segments and 36 random segments. 55 STUDY RESULTS How much do the coders agree with each others observations ? Segment Type # of Agreements Stuck segments 23 # of % Observations Agreement 26 88% What about the stuck segments? Random segments 30 36 83% Total 53 62 85% Coders agreed that 50% of making progress segments were having difficulty 3 segments coders were quicker than the tool Remaining 15 segments coders considered the inactivity of developers as being stuck 56 STUCK SEGMENTS Entity # of Agreements # of % Observations Agreement Tool 13 13 100% First Author 7 9 77% Participant 3 4 75% Total 23 26 88% Coders agreement with: Tool Me Total of 26 stuck segments Participants 57 CODERS AGREEMENT WITH THE TOOL Entity # of Agreements Tool 13 # of % Observations Agreement 13 100% Participants never corrected the tool when having difficulty was guessed Coders agreed with the tool 100% of the time when participants were stuck What about the coders agreement with me? 58 CODERS AGREEMENT WITH ME Entity # of Agreements First Author 7 # of % Observations Agreement 9 77% In 9 segments, I classified the making progress prediction of the tool as having difficulty Coders agreed with seven of these observations or 77% What about the coders agreement with the participants? 59 CODERS AGREEMENT WITH PARTICIPANTS Entity # of Agreements Participant 3 # of % Observations Agreement 4 75% Coders agreed with the participant 75% of the time We can draw several preliminary conclusions from this data 60 PRELIMINARY CONCLUSIONS We can look at this data from three perspectives Tool When the tool predicts having difficulty, all entities agree with it 100% The tool does not miss times when programmers are having difficulty We can also look at this data from the participants’ perspective 61 PARTICIPANTS’ PERSPECTIVE 9 student programmers 5 industry programmers Participants only corrected the tool 4 out of 1222 times in the study Use this as ground truth, the tool is correct 99.6% of the time We can look at this data from the coders’ perspective 62 CODERS’ PERSPECTIVE According to coders our tool missed half of the times when programmers were stuck Coder Who’s perspective should we trust? Let’s consider both the participants and coders Participants Coder 63 PARTICIPANTS’ PERSPECTIVE AS GROUND TRUTH Why participants? Coders could not read the mind of participants Asked participants about the accuracy of the tool "I think it worked pretty well; It's non-intrusive, and only pops up with information when the status changes." It knew when I was having issues cause it switched to slow progress and when I was flyin doing all the class design it said progress." Survey Question Q1 I felt that the accurate. Mean tool was 6 Median STDDEV 6 .95 What about coders? 64 CODERS’ PERSPECTIVE AS GROUND TRUTH Why coders? Participants tend to underreport their problems If we use coders, the tool would have missed a lot of the times developers had difficulty Why did the tool miss when developers had difficulty? 65 EXPLANATIONS FOR WHY THE TOOL WAS INACCURATE The training data used to predict developers’ status from six student programmers Student programmers worked on research and class assignments Behavior of the two groups is different Does this inaccuracy mean the tool is not useful? 66 THE TOOL IS USEFUL Missing some having difficulty status is no worse then the current practice If it is considered desirable to automatically let others know about developers’ difficulties It seems better to use our tool than not use it What about privacy concerns? 67 USERS WE TRUST What groups of users can be trusted? Student Mentor/Advisor Friends There are also other users who may judge a developer 68 PRIVACY CONCERNS: JUDGERS Use information about a programmer being stuck repeatedly in a negative manner Cause people to lose respect in their team Programmers may want investigate the problem more They may not want to share the part of the code with which they are having difficulty To solve this problem, we used a feature already available in Google Talk 69 BLOCK USERS We allow users to block judgers Problem: Blocked users cannot see the status of the blocker and vice versa Problem: If you blocked a manager or your advisor because the person will know that you blocked them To overcome this problem, we created two panels in the training user interface 70 A SOLUTION TO PRIVACY CONCERNS Actual Status Reported Status Buddy List Actual Status Pane Reported Status Pane Countdown 71 SUMMARY It is useful and possible to automatically identify that distributed programmers are stuck Described our approach to solve this problem and presented a useful set of features We built this approach into two programming environments X = (# of edits+ # of debugs + # of focus events + # of navigations + # of remove events ) * 100 Edit %= # of edits / X Debug %= # of debugs / X Validated our approach through user studies Focus %= # of focus events / X Navigation %= # of navigations / X Remove %= # of remove events / X Developers’ actions WOX XML Developers’ actions Serialized object 72 FUTURE WORK: APPLICATIONS Tool that can determine when programmers are stuck ~90% accurate Distributed Programming Lab Predict time spent and task difficulty Expertise How does being stuck affect a developers’ day? 73 FUTURE WORK: TIME SPENT Hours 74 FUTURE WORK: PREDICTING TIME SPENT FUTURE WORK: EXPERTISE Lower the percentage, the harder the problem. More difficult problem, more times users were stuck Easier problem, less times users were stuck Difficulty Level is determined by percentage of ACM teams that got problem correct 76 THANK YOU! 77