Download Recommendation Systems for Software Reuse

Document related concepts
no text concepts found
Transcript
Recommendation Systems
for Code Reuse
Tao Xie
Department of Computer Science
North Carolina State University
Raleigh, USA
1
Motivation
• Programmers commonly
reuse APIs of existing
frameworks or libraries
Frame
works
– Advantages: Low cost and high
efficiency of development
– Challenges: Complexity and
lack of documentation
E.g., searching for information
nearly ¼ of developer time
[metallect.com]
2
2
Example Task from Eclipse Programming
Task: How to parse code in a dirty editor of Eclipse?
Open Source Projects
1
2
…
MIS 1
MIS 2
...
…
N
MIS k
…
Query:
“IEditorPart
-> ICompilationUnit”
Mine
Extract
?
*MIS: Method-Invocation sequence, FMIS: Frequent MIS
Recommend
FMIS 1
FMIS 2
…
FMIS n
PARSEWeb [Thummalapenta&Xie ASE 07]
Scenario 1
How to
use these
APIs?
• While reusing APIs of existing
open source frameworks or
libraries, programmers often
– know what type of object they need
– but do not know how to write code
for getting that object
Query: “Source Destination”
Prospector [Mandelin et al. PLDI 05 ], XSnippet [Sahavechaphan&Claypool OOPSLA 06 ],
PARSEWeb [Thummalapenta&Xie ASE 07]
4
Example Task from Eclipse Programming
• Task: How to parse code in a dirty editor?
• Query: IEditorPart  ICompilationUnit
• Example solution from Prospector/PARSEWeb:
IEditorPart iep = ...
IEditorInput editorInp = iep.getEditorInput();
IWorkingCopyManager wcm = JavaUI.getWorkingCopyManager();
ICompilationUnit icu = wcm.getWorkingCopy(editorInp);
• Difficulties:
a. Needs an instance of IWorkingCopyManager
b. Needs to invoke a static method of JavaUI for getting the preceding
instance
Prospector [Mandelin et al. PLDI 05 ], XSnippet [Sahavechaphan&Claypool OOPSLA 06 ],
PARSEWeb [Thummalapenta&Xie ASE 07]
5
Scenario 2
How to
use these
APIs?
• While reusing APIs of existing open
source frameworks or libraries,
programmers often
– know what method call they need
– but do not know how to write code
before and after this method call
Query: “Method name”
MAPO [Xie&Pei MSR 05]
6
Example Task from BCEL Programming
• Task: How to instrument the bytecode of a Java class
by adding an extra method to the class?
• Query: org.apache.bcel.generic.ClassGen
public void addMethod(Method m)
• Example solution from MAPO:
public void generateStubMethod(ClassGen c)
InstructionList il =
new InstructionList();
MethodGen m= genFromISList(il);
m.setMaxLocals();
m.setMaxStack();
c.addMethod(m.getMethod());
System.out.println(“…”);
…
}
MAPO [Xie&Pei MSR 05]
7
Scenario 3
How to
use these
APIs?
• While reusing APIs of existing open
source frameworks or libraries,
programmers often
– know structural context such as a
class’ type, its parents, and fields’
types, a method’s signature, method or
constructor callees
– but do not know how to write code in
this context
Query: Structural context
Strathcona [Holmes et al. 05], XSnippet [Sahavechaphan&Claypool OOPSLA 068 ]
Example Task from HttpClient Programming
• Task: How to evolve a system to use a third party
library, HttpClient, for handling http connections?
• Query: HttpClient, PostMethod classes
• Example solution from Strathcona:
Strathcona [Holmes et al. 05], XSnippet [Sahavechaphan&Claypool OOPSLA 069 ]
Steps in Recommenders
•
•
•
•
•
Data collection/extraction
Data preprocessing
Data analysis/mining
Result postprocessing
Result representation
10
Data Collection/Extraction
• From one or multiple local code
repositories
– Often followed by offline analysis or mining
– Challenges: lack of relevant code examples
– Ex.: Strathcona, Prospector, XSnippet
• From the whole open source world with a
code search engine!
– Often followed by on-the-fly analysis and mining
– Challenges: only partial code files
– Ex.: MAPO, PARSEWeb
11
Exploiting A Code Search Engine
• Accepts queries including keywords of classes or/and
method names
• Interacts with a code search engine such as Google
code search to gather related code samples
• Stores gathered code samples (source files) in a local
code repository (later being analyzed and mined)
• Challenges: gathered code samples are partial and not
compilable as code search engines retrieve individual
source files instead of entire projects
PARSEWeb [Thummalapenta&Xie ASE 07]
12 12
Available Code Search Engines
• Google Code Search
http://www.google.com/codesearch
• Krugle: http://www.krugle.com/
• Koders: http://www.koders.com/
• Codase: http://www.codase.com/
• JExamples: http://www.jexamples.com/
etc.,
Why not using just code search engines?
13 13
What are Developers Searching for?
15 million
queries of
Windows Live
Search from
May 2006.
339 sessions
related to
Java
programming
117 API sessions (34.2%); 70 trouble-shooting sessions (20.6%)
Assieme [Hoffmann et al. UIST 07]
API-related Search Sessions
• 64.1% sessions contained queries that were
merely descriptive but did not contain actual
names of APIs, packages, types, or members.
• The remaining sessions contained
– API or package names (12.8%),
– Type names (17.9%)
– Method names (5.1%).
• Among all these API-related sessions, 17.9%
contained terms like “example”, “using”, or
“sample code”
Assieme [Hoffmann et al. UIST 07]
15 15
An Example 4-Query Session
•
•
•
•
java JSP current date
java JSP currentdate
java SimpleDateFormat
using currentdate in jsp
Assieme [Hoffmann et al. UIST 07]
16 16
Why Not Use Web Search Engines?
parse xml java
Only compatible with new Java versions
Requires installation of external library,
but no link
Code on pages essentially the same
Contains no code examples
Assieme [Hoffmann et al. UIST 07]
©Raphael Hoffmann
Code Search Engines
Index source code of open-source
Projects (from compressed archive
Files and CVS repositories)
Code is parsed and terms in type
names, variable names, etc. are
weighted differently.
import javax.xml.parsers.*;
import org.w3c.dom.*;
public class JAXPSample {
public static void main(String[] args) {
String filename = "sample.xml";
try {
DocumentBuilderFactory factory =
DocumentBuilderFactory.newInstance();
DocumentBuilder parser =
factory.newDocumentBuilder();
Document d = parser.parse(filename);
} catch (Exception e) {
System.err.println("Exception: " + e.getMessage());
}
}
}
Assieme [Hoffmann et al. UIST 07]
©Raphael Hoffmann
Why not use code search engines only?
parse xml java
Irrelevant
(An Emacs Lisp File!?!)
Code is complicated,
contains no comments related to query,
and is more than 300(!) lines long
Requires installation of external library,
but no link
Code on pages essentially the same
Assieme [Hoffmann et al. UIST 07]
©Raphael Hoffmann
Why not use code search engines only?
MAPO [Xie&Pei MSR 06]
Steps in Recommenders
•
•
•
•
•
Data collection/extraction
Data preprocessing
Data analysis/mining
Result postprocessing
Result representation
21
Fact Extraction
• Whole-program analysis: applicable
when the whole code bases are available
and compilable
• Partial-program analysis: applicable
when only partial code samples are
available and not compilable
– When a code search engine is used
22
Analysis of Partial Code Samples
• Not all code samples contain main method or
driver code that can serve as an entry point
– consider all public methods as entry points
• Deal with local method calls by inlining
methods
• Deal with conditionals/loops by traversing
control flow graphs
• Deal with unknown types with heuristics
PARSEWeb [Thummalapenta&Xie ASE 07]
23 23
Type Heuristics I
• Inferring fully qualified class names
import javax.jms.QueueSession;
import java.util.*;
Public class test {
public QueueSession qsObj;
public Integer intObj;
public Iterator iter;
…
- Fully qualified name of QueueSession is “javax.jms.QueueSession”,
inferred through lookup of import statement
- Fully qualified name of Integer is “java.lang.Integer”, inferred through
loading of a class by appending “java.lang” to the class name
- Cannot infer the fully qualified name of “Iterator” (incorporating domain
knowledge of java.util helps)
PARSEWeb [Thummalapenta&Xie ASE 07]
24 24
Type Heuristics II
• Infer the receiver type in expression “X.Y”
– Lookup the declaration of X in local variables or
member variables. If not, “X” is a class name and Y
is a static member
• Infer the receiver type in expression
“M1().Y”
– Check the return type of M1() method declaration, if
not available locally, the receiver type cannot be
inferred
PARSEWeb [Thummalapenta&Xie ASE 07]
25 25
Type Heuristics III
• Infer the return type of a method invocation in
an assignment statement such as “Queue qObj =
createQueueSession()”
– Lookup the type of the variable on the left hand side. The return
type is the same as or a sub class of Queue
• Infer the return type of a method invocation in a
return statement such as
public QueueSession test()
{
...
return connect.createQueueSession(false,int);
}
-
Lookup the return type of the enclosing method declaration
PARSEWeb [Thummalapenta&Xie ASE 07]
26 26
Type Heuristics IV
• Infer types with multiple method
invocations
Queue qObj = connect.m1();
Stack sObj = connect.m1().m2();
The receiver type of m2() can be inferred
from the lookup of the return type of m1()
PARSEWeb [Thummalapenta&Xie ASE 07]
27 27
Sequence Filtering
• Remove common Java library calls
• Remove sequences that contain no query
words: ClassGen and addMethod
public void generateStubMethod(ClassGen c)
InstructionList il =
InstructionList.<init>()
new InstructionList();
genFromISList(InstructionList)
MethodGen m= genFromISList(il);
MethodGen.setMaxStack()
m.setMaxLocals();
MethodGen.setMaxLocals()
m.setMaxStack();
MethodGen.getMethod()
c.addMethod(m.getMethod());
ClassGen.addMethod(Method)
System.out.println(“…”);
PrintStream.println(String)
…
…
}
MAPO [Xie&Pei MSR 05]
28
Type Signature Graph
Any path from h to w is a (h,w)-jungloid
getResource()
IJavaElement
getParent()
IResource
IContainer
supertype
IClassFile
IFile
AST.parseCompilationUnit()
CompilationUnit
ICompilationUnit
Prospector [Mandelin et al. PLDI 05 ]
AST.parseCompilationUnit()
ASTNode
Jungloids with Downcasts
IDebugView debugger = ...
Viewer viewer = debugger.getViewer();
IStructuredSelection sel = (IStructuredSelection) viewer.getSelection();
JavaInspectExpression expr = (JavaInspectExpression) sel.getFirstElement();
IDebugView
getViewer()
Viewer
getSelection()
ISelection
downcast
IStructuredSelection
Object
downcast
Prospector [Mandelin et al. PLDI 05 ]
JavaInspectExpression
Steps in Recommenders
•
•
•
•
•
Data collection/extraction
Data preprocessing
Data analysis/mining
Result postprocessing
Result representation
31
Data Analysis/Mining
• Some recommenders don’t use specific
mining techniques to “abstract” or
“generalize” common patterns but return
relevant raw code samples
– Prospector, Strathcona, XSnippet, PARSEWeb
• Data mining can be used to uncover hidden
patterns
– Association rules: CodeWeb [Michail ICSE 00]
– Frequent subsequences: MAPO [Xie&Pei MSR 06]
– Frequent partial orders: Apiator [Acharya et al. FSE 07]
32
Association Rules
KApplication reuse patterns
CodeWeb [Michail ICSE 00]
Frequent SubSeq/Partial Order
#include <abcdef.h>
void p ( ) { b ( ); c ( ); }
void q ( ) { c ( ); b ( ); }
void r ( ) { e ( ); f ( ); }
void s ( ) { f ( ); e ( ); }
Consider APIs a, b, c, d, e, and f
Apiator [Acharya et al. FSE 07]
int main ( )
{
int i, j, k;
a ( );
if ( i == 1)
{
f ( ); e ( ); c ( );
exit ( );
}
else
{
if ( j == 1 )
p ( );
else
q ( );
d ( );
if ( k == 1 )
r ( );
else
s ( );
}
}
Frequent SubSeq/Partial Order
#include <abcdef.h>
void p ( ) { b ( ); c ( ); }
void q ( ) { c ( ); b ( ); }
void r ( ) { e ( ); f ( ); }
void s ( ) { f ( ); e ( ); }
int main ( )
{
int i, j, k;
a ( );
if ( i == 1)
{
f ( ); e ( ); c ( );
exit ( );
}
else
{
if ( j == 1 )
p ( );
else
q ( );
d ( );
if ( k == 1 )
r ( );
else
s ( );
}
}
Consider APIs a, b, c, d, e, and f
1
2
3
4
5
(a) Example code
Apiator [Acharya et al. FSE 07]
afec
abcdef
acbdef
abcdfe
acbdfe
abde
abdf
acde
acdf
(c) Frequent sequential patterns
Support 4/5
(b) Static program traces
a
b
c
d
e
f
(d) Frequent partial order R
Frequent SubSeq/Partial Order
#include <abcdef.h>
void p ( ) { b ( ); c ( ); }
void q ( ) { c ( ); b ( ); }
void r ( ) { e ( ); f ( ); }
void s ( ) { f ( ); e ( ); }
int main ( )
{
int i, j, k;
a ( );
if ( i == 1)
{
f ( ); e ( ); c ( );
exit ( );
}
else
{
if ( j == 1 )
p ( );
else
q ( );
d ( );
if ( k == 1 )
r ( );
else
s ( );
}
}
Consider APIs a, b, c, d, e, and f
1
2
3
4
5
(a) Example code
Apiator [Acharya et al. FSE 07]
afec
abcdef
acbdef
abcdfe
acbdfe
abde
abdf
acde
acdf
(c) Frequent sequential patterns
support, 4/5
(b) Static program traces
a
b
c
d
e
f
(d) Frequent partial order R
Frequent SubSeq/Partial Order
#include <abcdef.h>
void p ( ) { b ( ); c ( ); }
void q ( ) { c ( ); b ( ); }
void r ( ) { e ( ); f ( ); }
void s ( ) { f ( ); e ( ); }
int main ( )
{
int i, j, k;
a ( );
if ( i == 1)
{
f ( ); e ( ); c ( );
exit ( );
}
else
{
if ( j == 1 )
p ( );
else
q ( );
d ( );
if ( k == 1 )
r ( );
else
s ( );
}
}
(a) Example code
1
2
3
4
5
MAPO
afec
abcdef
acbdef
abcdfe
acbdfe
abde
abdf
acde
acdf
(b) Static program traces
(c) Frequent sequential patterns
support, 4/5
a
Apiator
b
c
d
e
f
(d) Frequent partial order R
Apiator [Acharya et al. FSE 07] MAPO [Xie&Pei MSR 05]
Data Analysis/Mining
•
•
•
•
•
Data collection/extraction
Data preprocessing
Data analysis/mining
Result postprocessing
Result representation
38
Result Postprocessing
• When a third-party miner or learner isn’t
used, this step may be considered part of
the data analysis/mining step.
Examples
• Result clustering
• Result ranking
• Result filtering
39
Clustering and Ranking
• Candidate method sequences produced by the
data analysis/mining step for query “Source
Destination” may be too many
Solutions:
• Cluster similar sequences
– Clustering heuristics are developed
• Rank sequences
– Ranking heuristics are developed
PARSEWeb [Thummalapenta&Xie ASE 07]
40 40
Clustering Heuristics
• Method-invocation sequences with the same
set of statements can be considered similar,
although the statements are in different order.
e.g., ''2 3 4 5'' and ''2 4 3 5 ''
• Method-invocation sequences with minor
differences measured by an attribute cluster
precision value can be considered similar.
e.g., ''8 9 6 7'' and ''8 6 10 7 '' can be considered similar under
cluster precision value one
PARSEWeb [Thummalapenta&Xie ASE 07]
41 41
Ranking Heuristics
• Heuristic 1: Higher frequency -> Higher
rank
• Heuristic 2: Shorter length -> Higher
rank
• Heuristic 3: Fewer package boundaries ->
Higher rank
Prospector [Mandelin et al. PLDI 05 ] PARSEWeb [Thummalapenta&Xie ASE 07] 42
42
Query Splitting
• Lack of code samples that give candidate methodinvocation sequences in the results of code search
engines
– Required method-invocation sequences are split among
different source files
• Solution:
– Split the user query into multiple queries
– Compose the results for each split query
PARSEWeb [Thummalapenta&Xie ASE 07]
43
Query Splitting Example
1. User query: “org.eclipse.jface.viewers.IStructuredSelection->java.io.ObjectInputStream”
Results: None
2. Query: “java.io.ObjectInputStream”
Results: 3.
Most used immediate sources are: java.io.InputStream, java.io.ByteArrayInputStream,
java.io.FileInputStream
3. Three Queries to be fired:
“org.eclipse.jface.viewers.IStructuredSelection-> java.io.InputStream”
Results: 1
“org.eclipse.jface.viewers.IStructuredSelection-> java.io.ByteArrayInputStream” Results: 5
“org.eclipse.jface.viewers.IStructuredSelection-> java.io.FileInputStream”
PARSEWeb [Thummalapenta&Xie ASE 07]
Results: None
44
Result Filtering
• Remove sequences that contain no query
words: ClassGen and addMethod
• Compress consecutive calls of the same
method into one, e.g., abbba  aba
• Remove duplicate frequent sequences
after the compression, e.g., aba, aba 
aba
• Reduce a seq if it is a subseq of another,
e.g., aba, abab  abab
MAPO [Xie&Pei MSR 06]
45
Data Analysis/Mining
•
•
•
•
•
Data collection/extraction
Data preprocessing
Data analysis/mining
Result postprocessing
Result representation
46
Result Representation
• Display results in the tool user interface
–
–
–
–
–
–
Strathcona
XSnippet
PARSEWeb
MAPO
CodeBroker
Assieme
47
Strathcona
Strathcona [Holmes et al. 05]
48 48
XSnippet
XSnippet [Sahavechaphan&Claypool OOPSLA 06 ]
49 49
PARSEWeb
PARSEWeb [Thummalapenta&Xie ASE 07]
50 50
http://news.google.com/
PARSEWeb
51 51
MAPO (new)
MAPO [Xie&Pei MSR 06]
52 52
MAPO (new)
MAPO [Xie&Pei MSR 06]
53 53
CodeBroker
CodeBroker [Ye&Fischer ICSE 01]
Information delivery that
autonomously locates
and presents software
developers with taskrelevant and personalized
components. Active
repository!!!
Assieme
• A hybrid search engine
• Index code snippets
found on web pages
• Link them to required
libraries and
documentation
Assieme [Hoffmann et al. UIST 07]
Assieme
links to
required libraries
links to
pages with snippets
group pages with
similar snippets
Assieme [Hoffmann et al. UIST 07]
Example Evaluations of
Recommenders
• Prospector
• Strathcona
• PARSEWeb
Prospector Experiment 1
(ranking test)
• hypothesis:
– to find the desired code, the user needs to
examine only top 5 candidate jungloids.
• result:
– desired code in “top 5” 17 out 20 times (10
out of 20, in “top 1”)
– remaining three fixable
• methodology:
– used 20 real-world coding tasks
– collected from FAQs, newsgroups, our
practice, emails to us
Prospector Experiment 2
(user study)
• hypothesis:
– Prospector-equipped programmers are better at
solving API programming problems than other
programmers
• methodology:
– 6 problems, each user did 3 with Prospector and 3
without
– problems formulated not to reveal the query
– sample problem:
“The new Java channel IO system represents files as
channels. How do I get a channel that represents a
String filename?”
– somewhat sparse data (10 users)
Experiment 2 (user study).
Results.
• Prospector shortens development time
– some problems solved only by Prospector users
– when both groups succeeded, Prospector users
30% faster
• Prospector may help enable reuse
– non-Prospector users sometimes reimplemented
• Prospector may help avoid making mistakes
– mistakes applying code found on internet into
own code
• The authors expect even stronger results
on a more robust infrastructure.
Strathcona: User Study
• 2 developers were assigned 4 tasks on building a plug-in for
Eclipse. Neither developers knew how to implement any of the
tasks at hand.
Table 2: Results from Evaluation:
Useful Example
Source Viewed
Task 1
Subject 1
1
1
Subject 2
1
1
Task 2
Subject 1
1
2
Subject 2
1
6
Task 3
Subject 1
0
2
Subject 2
0
6
Task 4
Subject 1
1
2
Subject 2
0
7
Succeeded at Task
yes
yes
yes
yes
yes
yes
yes
partially
• The results showed that the tool can deliver relevant and useful
examples to developers. They also showed a developer can
determine when the examples returned are not relevant.
Strathcona [Holmes et al. 05]
Strathcona: Performance and Scalability
• As a test case for scalability, Eclipse 3.0
source was populated to the repository.
The resulting amount of information in the
repository is shown in Table1.
• On a Pentium 3 800 MHz 1024 MB RAM
Server, a Pentium 3 1000 MHz 256 MB
RAM Repository with Postgresql DB the
performance numbers are:
Table 1: Number of Structural
Relations
Classes
Methods
Fields
Inheritance Relations
Object Instant ions
Calls Relations
17,456
124,359
48,441
15,187
43,923
1,066,838
Total
1,316,204
– Less than 500 ms for building a structural context.
– Less than 300 ms for displaying the example.
– 4 – 12 seconds server response time.
Strathcona [Holmes et al. 05]
PARSEWeb Evaluations
• Real Programming Problems: To address problems posted
in developer forums
• Real Projects: To show that solutions recommended by
PARSEWeb are
– available in real projects
– better than solutions recommended by related tools PROSPECTOR,
Strathcona, and Google Code Search averagely
63
Real Programming Problems
Jakarta BCEL user forum, 2001
Problem: “How to disassemble java byte code”
Query: “Code  Instruction”
Solution Sequence:
FileName:2_RepMIStubGenerator.java MethodName: isWriteMethod Rank:1
NumberOfOccurrences:1
Code,getCode() ReturnType:#UNKNOWN#
CONSTRUCTOR,InstructionList(#UNKNOWN#) ReturnType:InstructionList
InstructionList,getInstructions() ReturnType:Instruction
Solution Sample Code:
Code code;
InstructionList il = new InstructionList(code.getCode());
Instruction[] ins = il.getInstructions();
64
Real Programming Problems
Dev 2 Dev Newsgroups, 2006
Problem: “how to connect db by sessionBean”
Query: javax.naming.InitialContext  java.sql.Connection
Solution Sequence:
FileName:3 AddressBean.java MethodName:getNextUniqueKey Rank:1
NumberOfOccurrences:34
javax.naming.InitialContext,lookup(java.lang.String)
ReturnType:javax.sql.DataSource
javax.sql.DataSource,getConnection()
ReturnType:java.sql.Connection
65
Real Project: Logic
•
Source File: LogicEditor.java
SUMMARY-> PARSEWeb: 8/10, Prospector: 6/10, Strathcona: 5/10
66
Comparison with Prospector
•
12 specific programming tasks taken from XSnippet approach.
SUMMARY-> PARSEWeb: 11/12, Prospector: 7/12
67
Comparison with Other Tools
Percentage of tasks successfully completed by PARSEWeb,
Prospector, and XSnippet
68
Significance of Internal Techniques
*Legend:
Method inline: Method inlining
Post Process: Sequence Post Processor
Query Split: Query Splitter
69
Questions?
Bibliography on Mining Software Engineering Data
http://ase.csc.ncsu.edu/dmse/
•What software engineering tasks can be helped by data mining?
•What kinds of software engineering data can be mined?
•How are data mining techniques used in software engineering?
•Resources
Available Data Mining Tools
http://ase.csc.ncsu.edu/dmse/resources.html
T. Xie Mining Program Source Code
70
Mining Partial Orders
Consider APIs a, b, c, d, e, and f
Partial Order
Partial Order with
Transitive Reduction
Closed Partial Order
The extracted scenarios are fed to a partial order miner
The partial order miner mines frequent closed partial order
Apiator [Acharya et al. FSE 07]
71
Example Partial Order
XOpenDisplay
XCreateWindow
XCreateGC
XGetWindowAttributes
XSelectInput
XMapWindow
XSetForeground
XGetBackground
XNextEvent
XGetAtomName
XFreeGC
XCloseDisplay
XChageWindowAttributes
XMapWindow
A usage scenario around
XOpenDisplay API as a
partial order.
Specifications are shown
with dotted lines.
Apiator [Acharya et al. FSE 07]