Download Future Reserch

Document related concepts

Mixture model wikipedia , lookup

Cluster analysis wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Transcript
A SOFTWARE ENGINEERING TOOL
FOR DISTRIBUTED DEVELOPMENT
Jason Carter
Prasun Dewan
University of North Carolina at Chapel Hill
MOTIVATION
Grrr…
Hmm… is
Bob stuck?
Programmer
Bob
Programmer’s
Mentor/Teammate
2
APPLICATIONS
Need
Help?
Student
Manager
Offer help to student programmers
who are too shy to ask for it
Significantly improve
programmer productivity
Benefits of this idea may occur in industry
3
CO-LOCATED VS. DISTRIBUTED
Distributed Team
Less
productivity
More
productivity
Co-Located Team
[2] Herbsleb, J.D., et al. Distance, dependencies, and delay in a global collaboration. in Proc. CSCW 2000.
4
PRODUCTIVITY HIGHER IN
WAR-ROOMS THAN IN CUBICLES
War-room
Cubical
Combined these studies show
Teasley, S., et al. How does radical collocation help a team succeed? in Proc. CSCW 2000.
5
DISTANCE IMPEDES DEDUCTION
Hmm… is
Alice stuck?
Bob
Grrr…
Distance
Developers often do not explicitly ask for help
Alice
How do we reduce this gap?
6
COLLABVS
Developers are aware of methods
their distributed teammates
are working on
Use this information with project
information to manually determine
if teammate is stuck
Distributed users the feeling of “being there” in a single location
Hedge R. and Dewan P. Connecting Programming Environments to Support Ad-Hoc Collaboration in 23rd IEEE/ACM International Conference on ASE. 2008.
7
CAN WE DO BETTER THAN BEING THERE?
Face-to-Face Interaction - “Being There”
Bob
Alice
“Beyond Being There”
Bob
Alice
How do we plan to go “beyond being there”?
Hollan, J. and Scott S. Beyond being there. CHI ’92.
8
“BEYOND BEING THERE”
You are having
difficulty.
Programmer
Bob
Bob is having
difficulty.
Programmer’s
Mentor/Teammate
There are several ways to infer this information
9
AUTOMATIC PREDICTION OF FRUSTRATION
PROBLEM: Overhead of using this non-standard equipment
Posture Seating Chairs
Wireless Bluetooth skin
conductance tests
Pressure Mice
Alternative approach
Video Camera
Determine this information by logging interaction
with some component of the system
Kapoor, A., Burleson, W., and Picard, R.W. "Automatic Prediction of Frustration," International Journal of Human-Computer Studies, Vol. 65, Issue 8, 2007.
10
DETERMINE IF PROGRAMMERS ARE INTERRUPTIBLE
0
Interruptibility
100
Developed a tool that uses developers’ actions to determine if they are
interruptible
Randomly interrupted developers
Can we use the this approach?
Fogarty, J., Ko, A.J., Aung, H.H., Golden, E., Tang, K.P. and Hudson, S.E. Examining task engagement in sensor-based statistical models of human interruptibility. In Proc. CHI 2005, ACM Press (2005),
331-340.
11
INFORMATION ABOUT EVENTS
0
Interruptibility
100
No random interruption would find a
developer is having difficulty
A better alternative is to allow developers to report their status
12
USE BUTTONS TO COLLECT INFORMATION
Buttons used to
indicate status
“Eureka Button”
To capture situations in which
developers did not realize they
had been having a problem until
they had solved it
“Notifications Enabled”
Allowed developers to determine if
they received status change
notifications
These buttons are useful only for
the training phase
Useful to run an initial naïve algorithm
13
OUR APPROACH
Use Buttons to
Gather Data
Determine a
Measure of
Progress Using a
Naive Algorithm
Derive Mining
Algorithm
Validate our
algorithm
Build a Model
14
BASIC INTUITION
You are having
difficulty.
Monitor progress of developers
progress < than some threshold
Threshold
indicate that they are having difficulty
0
Progress
100
Productivity
Progress
related but fundamentally different
15
RELATIONSHIP BETWEEN
PRODUCTIVITY AND PROGRESS
Usually measured after
developers have written code
Productivity
Measured while
developers write code
Progress
Little work has been done on measuring progress
The only work we could find was done by Kersten and Murphy
16
MYLAR: TOOL TO REDUCE NAVIGATION
# of Edit Commands
Edit Ratio
# of Navigation Commands
Kersten, M., Murphy, G. C., Mylar: A degree-of-interest model for IDEs. In Proc. Conference on Aspect-Oriented Software Development, 2005, 159-168.
17
OUR APPROACH: DETERMINE MEASURE OF
PROGRESS
Use Buttons to
Gather Data
Determine a
Measure of
Progress Using a
Naive Algorithm
Derive Mining
Algorithm
Validate our
Algorithm
Build a Model
18
METRICS TO MEASURE PROGRESS
# of Debug Commands
Edit Ratio
Low Threshold
Participants
Tasks
3 freshman
Class Assignments
3 graduate students
Class and Research Assignments
Stuck
Having Difficulty
Naïve algorithm did not predict the progress status well
Explore the logs and corrections to derive a better algorithm
Y. Sharon. Eclipseye—spying on eclipse. Bachelor’s thesis, University of Lugano, June 2007.
19
OUR APPROACH: DERIVE MINING ALGORITHM
Use Buttons to
Gather Data
Determine a
Measure of
Progress Using a
Naive Algorithm
Derive Mining
Algorithm
Validate our
Algorithm
Build a Model
20
DERIVING MINING ALGORITHM
Analyze Logs
To find patterns when developers indicated they were having difficulty
Features
Values that change when programmers are making
progress and having difficulty
A manual inspection of the logs showed that the frequency of certain edit
commands decreased when developers were having difficulty
21
COMMANDS GROUPED INTO FIVE CATEGORIES
Depending on the developer, the frequency of execution of other commands
increased
Navigation
• Switch
between
files
• Search
for code
Edit
Remove
Debug
• Edits
program
code
(insertion
/deletion)
• Remove a
class or
method
• Debug or
Execute
code
Focus
• Eclipse
loses
focus
• Eclipse
gains
focus
We used this categories to create our features
22
IDENTIFYING FEATURES
For different segments of the log we calculated:
the occurrences of each category of commands in that segment
total # of commands in the segment
*
100
Used these percentages as features to identify patterns
The size of these segments is an important issue
23
DETERMINING SEGMENT SIZE
<action>
<eventType>SOLUTION_OPENED</eventType>
<solutionEvent>
<timestamp>9/20/2009 12:44:02 PM</timestamp>
</solutionEvent>
</action>
<action>
<commandEvent>
<command>Debug.Start</command>
<timestamp>9/20/2009 12:45:33 PM </timestamp>
</commandEvent>
</action>
<action>
<eventType>WINDOW_LOST_FOCUS</eventType>
<windowEvent>
< timestamp> 12:46:01 PM</ timestamp >
</windowEvent>
</action>
<action>
<eventType>WINDOW_GAINED_FOCUS</eventType>
<windowEvent>
< timestamp> 12:48:01 PM</ timestamp >
</windowEvent>
</action>
Segment Sizes:
Whole Log
200
100
25
50
Graphed the programming behavior of all participants to determine
usefulness of features
24
OUR APPROACH: VALIDATE ALGORITHM
Use Buttons to
Gather Data
Determine a
Measure of
Progress Using a
Naive Algorithm
Derive Mining
Algorithm
Validate our
Algorithm
Build a Model
25
GRAPHS TO VALIDATE FEATURES
26
GRAPHS TO VALIDATE FEATURES (CONT.)
27
GRAPHS TO VALIDATE FEATURES (CONT.)
The two graphs
validate our feature choice
show that a general model must account for differences in what percentages
change when developers are having difficulty
There are several standard ways to build a general model
28
NUMBER OF STUCK EVENTS SIGNIFICANTLY
LESS THAN TOTAL NUMBER OF EVENTS
Breakdown of Events
Making
Progress
Events
2212
Having
Difficulty
Events
76
Total Events: 2288
This leads to the imbalance class distribution problem
29
IMBALANCED CLASS DISTRIBUTION
Needle in a haystack
Disproportionate number of having difficulty
segments to making progress segments
“Standard” algorithms to predict making
progress ~97% of the time
Actual Status
Predicted Status
Making Progress
Making Progress
Making Progress
Making Progress
Making Progress
Making Progress
Having Difficulty
Making Progress
Making Progress
Making Progress
Making Progress
Making Progress
Accuracy of this model: 83%
Problem: Model can’t identify when a developer is having difficulty
30
SMOTE ALGORITHM
Replicates rare data, having difficulty, until there is more of a balance
between having difficulty statuses and making progress statuses
The replicated data of all developers were combined and used as input
to several standard algorithms to build a model
Making Progress
Having Difficulty
2212
1216
76
31
BUILD A MODEL
Applied mining algorithms
Logs – Replicated Data
Participant1-2
Participant3-4
Standard 10 fold
cross-validation
Model
Participant5-6
10 trials of model
construction executed
Each trial used 90% of
data for training
The remaining 10%
used as test data to
evaluate the model in
that trial
32
ACCURACY OF MODEL USING
DECISION TREE ALGORITHM
Actual Having
Difficulty
Actual Making
Progress
Decision Tree Model
Predicted
Having
Difficulty
Predicted
Making
Progress
1101
115
158
2130
Overall Accuracy: 92%
Predicted Having Difficulty: 90%
Developers were having difficulty 1216 times
Developers were making progress 2288 times
Witten, I.H. and Frank, E. (1999) Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann.
33
CLASSIFICATION VIA CLUSTERING
Designed to identify rare events without replicating records
Classification via Clustering Model
Predicted
Predicted
Having
Making
Difficulty
Progress
Actual Having
60
16
Difficulty
Actual Making
896
1392
Progress
Overall Accuracy: 67%
Predicted Having Difficulty: 79%
Developers were having difficulty 76 times
Developers were making progress 2288 times
Witten, I.H. and Frank, E. (1999) Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann.
34
OUR APPROACH
Use Buttons to
Gather Data
Determine a
Measure of
Progress Using a
Naive Algorithm
Validate our
algorithm
Build a Model
Derive Mining
Algorithm
Approach is promising
Left several unanswered questions
35
DOES APPROACH WORK IN PRACTICE?
Decision Tree Model
Features
Edit Percentage
Debug
Percentage
Focus Percentage
Classification Via Clustering Model
Navigation
Percentage
Remove
Percentage
Research group and one industrial developer used
software
Learned several important lessons
36
HAVING DIFFICULTY LESSON
Stuck button and the having difficulty status hurt my advisor’s ego
37
FREQUENT FALSE POSITIVES LESSON
Workflow system
Industry Developer
Building a new product
Started a new session
The navigations performed to build the working set of files
Sometimes needed more time to determine if the predicted change of status
was correct
38
LABEL AGGREGATION TECHNIQUE
50
50
50
50
Compute
Features
Compute
Features
Compute
Features
Compute
Features
# of
events
250
Two techniques account for the fact that developers’ status does not change
instantaneously
To give a more detailed explanation of how this works
39
IMPLEMENTATION OF DEVELOPER NOTIFICATIONS
There are 5 status predictions when reporting a dominant
status every 250 events
Status
Slow Progress
Prediction
Indeterminate
Status
Slow Progress
Prediction
Slow Progress
Status
Slow Progress
Prediction
Slow Progress
Status
Slow Progress
Prediction
Slow Progress
Status
MakingPrediction
Progress
Making Progress
Making Progress
Allowed the developer to correct a predicted
status to indeterminate
40
RESULTS OF PILOT STUDY
There was a total of 88
predictions made
Every hour we switched models
without interrupting the user
Model
# of predictions
# of times user
corrected status
Accuracy
Decision Tree
32
3
90%
Classification via
Clustering Model
56
10
82%
Status
# of times guessed # of times corrected
Accuracy
Slow Progress
17
2
88%
Making Progress 69
7
89%
Indeterminate
0
100%
2
Accuracy of this study is good
Large number of false negatives
How do we improve our accuracy?
41
COST OF PROCESSING INCREMENTAL INPUT EVENTS
LESSON
Advisor
noticeable
intolerable
3-year old laptop
42
CHANGES IN THE TOOL INCREASED
PROGRAMMING TIME AND EFFORT
Algorithm
Decision Tree Model
Classification Via Clustering Model
Do not share code
43
SOLUTION TO CREATING NEW CODE
FOR EACH PROGRAMMING ENVIRONMENT
Build an architecture that is independent of the
programming environment
Decision Tree Algorithm
Architecture
Classification Via
Clustering Algorithm
Supports interoperability
Also put process on server
44
ARCHITECTURE
Made up of several modules
45
EVENT-INTERCEPTION MODULE
Developers’
actions
WOX
XML
Developers’
actions
Serialized
object
This module does several things:
Captures events from both Eclipse and Visual Studio
Passes these events to the prediction modules
Prediction modules are written in C# so events from Visual Studio could
be passed directly
Java events were converted to C# using standard libraries
46
PREDICATION MODULES
Mediator
WOX
IKVM
Event
Aggregator
Feature
Extractor
Prediction
Manager
Status
Aggregator
Previous
Model
Mediator allows modules to be loosely coupled
We can use several algorithms for event aggregation
47
DISCRETE CHUNKS/SLIDING WINDOW
AGGREGATION ALGORITHM
Discrete chunk of 3 events
Window, Window Size = 3
Can this tool work with professional programmers?
48
CONTROLLED USER STUDY
14 developers
9 student programmers
5 industry programmers
Having difficulty is rare
Make sure developers face difficulty during the study
Tasks are no impossible to solve
We use ACM programming problems
49
ACM PROGRAMMING PROBLEMS
Mid-Atlantic ACM Programming Contest
http://midatl.radford.edu/
Year Problem Title
# of Teams
that Solved
Problem
# of
Teams
Percentage
Correct
2006
Shrew-ology
43
138
31.16%
2004
Balanced Budget
Initiative
23
142
16.20%
2002
A Simple Question of
Chemistry
124
124
100%
Is self reporting reliable?
http://midatl.radford.edu/
50
PROBLEM WITH SELF REPORTING
Previous research
I look
good!!!
over-confident in evaluating
their own situations
I don’t
think so
judgments by observers are more
reliable than self-reports
possibly due to people's perception of
themselves
Shrauger, J. S. and Osberg, T. M. (1981). The Relative Accuracy of Self-Predictions and Judgments by Others in Psychological Assessment. Psychological Bulletin 90(2).
322-351.
51
EXPERIENCES WITH SELF REPORTS
Developers had a hard time indicating they were having difficulty
‘Having difficulty’ status hurt developers' egos
Problem addressed in our CSCW Video
Hurt ego could cause developers
to misrepresent their status
Coders observe developers’
programming activities
Independent determination of their status
52
OBSERVE DEVELOPERS
Coder
Developer
Record programming sessions
and give them to two coders
Recorded 40 hours and 44 minutes of video
How do we relieve coders from having to focus on hours of video?
Tang, J. C., Liu, S. B., Muller, M., Lin, J., and Drews, C. 2006. Unobtrusive but invasive: using screen recording to collect field data on computer-mediated interaction. In Proceedings of the 2006 20th
Anniversary Conference on Computer Supported Cooperative Work (Banff, Alberta, Canada, November 04 - 08, 2006). CSCW '06. ACM, New York, NY, 479-482. DOI=
http://doi.acm.org/10.1145/1180875.1180948
53
RELIEVING CODERS
No Status
Created a tool that marks “points of interest” or segments for coders to view
Stuck
Tool
Participants
Segments:
Few stuck segments, so coders
classified all of these
Many making progress segments
Me (while observing the participants
writing code)
How did we choose progress
segments?
Coders would have to watch over 40
hours of video!
54
RELIEVING CODERS
Slow Progress
Progress
Coders were not aware of status of each
segment
Randomly sampled the making progress segments
Made the number of random segments the same as the number of having
difficulty segments
Each segment was two minutes of
video
Two coders and the first author
classified 26 stuck segments and 36
random segments.
55
STUDY RESULTS
How much do the coders agree with each others observations ?
Segment Type
# of
Agreements
Stuck segments
23
# of
%
Observations Agreement
26
88%
What about the stuck segments?
Random
segments
30
36
83%
Total
53
62
85%
Coders agreed that 50% of making progress segments were having difficulty
3 segments coders were quicker
than the tool
Remaining 15 segments coders considered
the inactivity of developers as being stuck
56
STUCK SEGMENTS
Entity
# of
Agreements
# of
%
Observations Agreement
Tool
13
13
100%
First Author
7
9
77%
Participant
3
4
75%
Total
23
26
88%
Coders agreement with:
Tool
Me
Total of 26 stuck segments
Participants
57
CODERS AGREEMENT WITH THE TOOL
Entity
# of
Agreements
Tool
13
# of
%
Observations Agreement
13
100%
Participants never corrected the tool when having difficulty was guessed
Coders agreed with the tool 100% of the time when participants were stuck
What about the coders agreement with me?
58
CODERS AGREEMENT WITH ME
Entity
# of
Agreements
First Author
7
# of
%
Observations Agreement
9
77%
In 9 segments, I classified the making progress prediction of the tool as
having difficulty
Coders agreed with seven of these observations or 77%
What about the coders agreement with the participants?
59
CODERS AGREEMENT WITH PARTICIPANTS
Entity
# of
Agreements
Participant
3
# of
%
Observations Agreement
4
75%
Coders agreed with the participant 75% of the time
We can draw several preliminary conclusions from this data
60
PRELIMINARY CONCLUSIONS
We can look at this data from three perspectives
Tool
When the tool predicts having difficulty, all entities agree with it 100%
The tool does not miss times when programmers are having difficulty
We can also look at this data from the participants’ perspective
61
PARTICIPANTS’ PERSPECTIVE
9 student programmers
5 industry programmers
Participants only corrected the tool 4 out of 1222 times in the study
Use this as ground truth, the tool is correct 99.6% of the time
We can look at this data from the coders’ perspective
62
CODERS’ PERSPECTIVE
According to coders our tool missed half of the times when
programmers were stuck
Coder
Who’s perspective should we trust?
Let’s consider both the participants and coders
Participants
Coder
63
PARTICIPANTS’ PERSPECTIVE AS GROUND TRUTH
Why participants?
Coders could not read the mind of
participants
Asked participants about the accuracy of the
tool
"I think it worked pretty well; It's non-intrusive, and only pops up with
information when the status changes."
It knew when I was having issues cause it switched to slow progress and
when I was flyin doing all the class design it said progress."
Survey Question
Q1
I felt that the
accurate.
Mean
tool was 6
Median
STDDEV
6
.95
What about coders?
64
CODERS’ PERSPECTIVE AS GROUND TRUTH
Why coders?
Participants tend to underreport their
problems
If we use coders, the tool would have missed a lot of the times
developers had difficulty
Why did the tool miss when developers had difficulty?
65
EXPLANATIONS FOR WHY
THE TOOL WAS INACCURATE
The training data used to predict developers’ status from six student
programmers
Student programmers worked on research and class assignments
Behavior of the two groups is different
Does this inaccuracy mean the tool is not useful?
66
THE TOOL IS USEFUL
Missing some having difficulty status is no worse then the current practice
If it is considered desirable to automatically let others know about
developers’ difficulties
It seems better to use our tool than not use it
What about privacy concerns?
67
USERS WE TRUST
What groups of users can be trusted?
Student
Mentor/Advisor
Friends
There are also other users who may judge a developer
68
PRIVACY CONCERNS: JUDGERS
Use information about a programmer being stuck repeatedly in a
negative manner
Cause people to lose respect in their team
Programmers may want investigate the problem more
They may not want to share the part of the code with which they are
having difficulty
To solve this problem, we used a feature already available in Google Talk
69
BLOCK USERS
We allow users to block judgers
Problem: Blocked users cannot see the
status of the blocker and vice versa
Problem: If you blocked a manager or
your advisor because the person will know
that you blocked them
To overcome this problem, we created two panels in the training user
interface
70
A SOLUTION TO PRIVACY CONCERNS
Actual Status
Reported Status
Buddy List
Actual Status Pane
Reported Status Pane
Countdown
71
SUMMARY
It is useful and possible to automatically identify that distributed programmers are stuck
Described our approach to solve this problem and presented a useful set of features
We built this approach into two programming environments
X = (# of edits+ # of debugs + # of focus events + #
of navigations + # of remove events ) * 100
Edit %= # of edits / X
Debug %= # of debugs / X
Validated our approach through user studies
Focus %= # of focus events / X
Navigation %= # of navigations / X
Remove %= # of remove events / X
Developers’ actions
WOX
XML
Developers’ actions
Serialized object
72
FUTURE WORK: APPLICATIONS
Tool that can determine when programmers are stuck ~90% accurate
Distributed
Programming Lab
Predict time spent and
task difficulty
Expertise
How does being stuck
affect a developers’ day?
73
FUTURE WORK: TIME SPENT
Hours
74
FUTURE WORK: PREDICTING TIME SPENT
FUTURE WORK: EXPERTISE
Lower the percentage, the harder the problem.
More difficult problem,
more times users were stuck
Easier problem,
less times users were stuck
Difficulty Level is determined by percentage of ACM teams that got
problem correct
76
THANK YOU!
77