Download Basics of Database Tuning

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Improving Programmer Productivity via
Mining Program Source Code
Tao Xie
Department of Computer Science
North Carolina State University
http://ase.csc.ncsu.edu/dmse/
Mining SE Data
• MAIN GOAL
– Transform static recordkeeping SE data to active
data
– Make SE data actionable
by uncovering hidden
patterns and trends
Bugzilla Mailings
Code
repository
T. Xie Mining Program Source Code
CVS
Execution
traces
2
Overview of Mining SE Data
programming
defect detection
testing
debugging
maintenance
…
software engineering tasks helped by data mining
association/
patterns
classification
clustering
…
data mining techniques
code
bases
change
history
program
states
structural
entities
bug
reports/nl
…
software engineering data
T. Xie Mining Program Source Code
3
Overview of Mining SE Data
99 ASE
00 ICSE
05 FSE*2
ASE
PLDI
POPL
OSDI
06 PLDI
OOPSLA
KDD
07 ICSE*3
FSE*3
ASE
PLDI*2
ISSTA*2
KDD
code
bases
99 ICSE
02 ICSE
03 PLDI
05 FSE
PLDI
06 ISSTA
07 ISSTA
04 ICSE
05 FSE*2
06 ASE
07 ICSE*2
change
history
program
states
99 FSE
01 ICSE
FSE
02 ISSTA
POPL
KDD
03 PLDI
04 ASE
ISSTA
05 ICSE
ASE
06 ICSE
FSE*2
07 PLDI
03 ICSE
06 ICSE
06 ASE
07 ICSE
SOSP
structural
entities
bug
reports/nl
…
software engineering data
T. Xie Mining Program Source Code
4
Overview of Mining SE Data
programming
defect detection
testing
debugging
maintenance
…
software engineering tasks helped by data mining
association/
patterns
classification
clustering
…
data mining techniques
code
bases
change
history
program
states
structural
entities
bug
reports/nl
…
software engineering data
T. Xie Mining Program Source Code
5
Overview of Mining SE Data
programming
defect detection
testing
debugging
maintenance
…
software engineering tasks helped by data mining
99 ASE
00 ICSE
05 FSE
PLDI
POPL
06 FSE
OOPSLA
PLDI
07 FSE
ASE
ISSTA
KDD
01 SOSP
04 OSDI
05 FSE*2
06 ICSE*2
07 ICSE*2
FSE*2
ISSTA
PLDI*2
SOSP
T. Xie Mining Program Source Code
99 ICSE
01 ICSE*2
FSE
02 ICSE
ISSTA
POPL
04 ISSTA
06 ISSTA
03 ICSE
PLDI*2
05 ICSE
FSE
ASE
PLDI
06 ICSE
FSE
07 ICSE
ISSTA
PLDI
02 KDD
04 ICSE
ASE
05 FSE
ASE*2
06 KDD
07 ICSE*3
6
Overview of Mining SE Data
programming
defect detection
testing
debugging
maintenance
…
software engineering tasks helped by data mining
association/
patterns
classification
clustering
…
data mining techniques
code
bases
change
history
program
states
structural
entities
bug
reports/nl
…
software engineering data
T. Xie Mining Program Source Code
7
Sample Projects on
Mining Program Source Code
Data
Set of functions,
variables, etc. in a C
function
Statement seq in a
basic block in C
Algorithms
Frequent
Itemset
Frequent
subsequence
Copy-paste bug finding
UIUC [OSDI 04]
Methods seq in a
Java method from
code search engine
Function seq in whole
C program
Frequent
subsequence
API usage patterns
NCSU [MSR 06]
Frequent
partial order
API usage patterns/properties
NCSU [FSE 07]
System dependence
graph in whole C
program
Java API method
signatures
Frequent
subgraph
Neglected-condition bug
finding CASE [ISSTA 07]
Plan generation
API Jungloids
Berkeley [PLDI 05]
Method seq in a Java Frequent
method from code
sequences
T. Xie Mining
Program Source Code
search
engine
Tasks
Programming-rules-related
bug finding UIUC [FSE 05]
API Jungloids
NCSU [ASE 07]
8
Some Recent Trends
• Data: dynamic execution data  +static code
bases
• Task: productivity (programming)  + quality
(defect detection, testing, debugging)
• Mining algorithm: simple ones (association
rule)  + frequent itemset/subsequence/
partial order/subgraph
• Data scope: local repositories  public
repositories with code search engines
T. Xie Mining Program Source Code
9
Sample Projects on
Mining Program Source Code
Data
Set of functions,
variables, etc. in a C
function
Statement seq in a
basic block in C
Algorithms
Frequent
itemset
Frequent
subsequence
Copy-paste bug finding
UIUC [OSDI 04]
Methods seq in a
Java method from
code search engine
Function seq in whole
C program
Frequent
subsequence
API usage patterns
NCSU [MSR 06]
Frequent
partial order
API usage patterns/properties
NCSU [FSE 07]
System dependence
graph in whole C
program
Java API method
signatures
Frequent
subgraph
Neglected-condition bug
finding CASE [ISSTA 07]
Plan generation
API Jungloids
Berkeley [PLDI 05]
Method seq in a Java Frequent
method from code
sequences
T. Xie Mining
Program Source Code
search
engine
Tasks
Programming-rules-related
bug finding UIUC [FSE 05]
API Jungloids
NCSU [ASE 07]
10
Mining API Usage Patterns
• How should an API be used correctly?
– An API may serve multiple functionalities
– Different styles of API usage
• MAPO: “I know what method call I need, but I
don’t know how to write code before and after this
method call” [Xie&Pei MSR 06]
T. Xie Mining Program Source Code
11
Example Task -- MAPO
• “instrument the bytecode of a Java class by
adding an extra method to the class”
– org.apache.bcel.generic.ClassGen
public void addMethod(Method m)
T. Xie Mining Program Source Code
12
First Try: ClassGen Java API Doc
addMethod
public void addMethod(Method m)
Add a method to this class.
Parameters:
m - method to add
T. Xie Mining Program Source Code
13
Second Try: Code Search Engine
T. Xie Mining Program Source Code
14
MAPO Approach
• Analyze code segments relevant to a given API
and disclose the inherent usage patterns
– Input: an API characterized by a method, class, or
package
– Code search engine: used to search relevant source
files from open source repositories
– Frequent sequence miner: use BIDE [Wang&Han 04] to
mine closed sequential patterns from extracted methodcall sequences
– Output: a short list of frequent API usage patterns
related to the API
T. Xie Mining Program Source Code
15
Sequence Extraction
• Method sequences: extracted from Java
source files returned from code search
engines
Source code
Call sequence
public void generateStubMethod(ClassGen c)
InstructionList il =
InstructionList.<init>()
new InstructionList();
genFromISList(InstructionList)
MethodGen m= genFromISList(il);
MethodGen.setMaxStack()
m.setMaxLocals();
MethodGen.setMaxLocals()
m.setMaxStack();
MethodGen.getMethod()
c.addMethod(m.getMethod());
ClassGen.addMethod(Method)
System.out.println(“…”);
PrintStream.println(String)
…
…
}
T. Xie Mining Program Source Code
16
Sequence Preprocessing
• Remove common Java library calls
• Inline callees of the same class
• Remove sequences that contain no query
words: ClassGen and addMethod
public void generateStubMethod(ClassGen c)
InstructionList il =
InstructionList.<init>()
new InstructionList();
genFromISList(InstructionList)
MethodGen m= genFromISList(il);
MethodGen.setMaxStack()
m.setMaxLocals();
MethodGen.setMaxLocals()
m.setMaxStack();
MethodGen.getMethod()
c.addMethod(m.getMethod());
ClassGen.addMethod(Method)
System.out.println(“…”);
PrintStream.println(String)
…
…
}
T. Xie Mining Program Source Code
17
Frequent Seq Postprocessing
• Remove sequences that contain no query
words: ClassGen and addMethod
• Compress consecutive calls of the same
method into one, e.g., abbba  aba
• Remove duplicate frequent sequences after
the compression, e.g., aba, aba  aba
• Reduce a seq if it is a subseq of another,
e.g., aba, abab  abab
T. Xie Mining Program Source Code
18
Tool Architecture
e.g.
koders.com
T. Xie Mining Program Source Code
19
Sample Mined API Sequence
InstructionList.<init>()
InstructionFactory.createLoad(Type, int)
InstructionList.append(Instruction)
InstructionFactory.createReturn(Type)
InstructionList.append(Instruction)
MethodGen.setMaxStack()
MethodGen.setMaxLocals()
MethodGen.getMethod()
ClassGen.addMethod(Method)
InstructionList.dispose()
T. Xie Mining Program Source Code
20
Sample Projects on
Mining Program Source Code
Data
Set of functions,
variables, etc. in a C
function
Statement seq in a
basic block in C
Algorithms
Frequent
itemset
Frequent
subsequence
Copy-paste bug finding
UIUC [OSDI 04]
Methods seq in a
Java method from
code search engine
Function seq in whole
C program
Frequent
subsequence
API usage patterns
NCSU [MSR 06]
Frequent
partial order
API usage patterns/properties
NCSU [FSE 07]
System dependence
graph in whole C
program
Java API method
signatures
Frequent
subgraph
Neglected-condition bug
finding CASE [ISSTA 07]
Plan generation
API Jungloids
Berkeley [PLDI 05]
Method seq in a Java Frequent
method from code
sequences
T. Xie Mining
Program Source Code
search
engine
Tasks
Programming-rules-related
bug finding UIUC [FSE 05]
API Jungloids
NCSU [ASE 07]
21
Mining API Usage Patterns
• MAPO: “I know what method call I need, but I
don’t know how to write code before and after this
method call” [Xie&Pei MSR 06]
• Apiartor: “I know what possible set of APIs I need,
but I don’t know what need to be used and what
orders to use” [Acharya et al. FSE 07]
T. Xie Mining Program Source Code
22
Usage Patterns as Partial Order
abde
abdf
acde
acdf
#include <abcdef.h>
void p ( ) { b ( ); c ( ); }
void q ( ) { c ( ); b ( ); }
void r ( ) { e ( ); f ( ); }
void s ( ) { f ( ); e ( ); }
int main ( ) {
int i, j, k;
a ( );
if ( i == 1) {
f ( ); e ( ); c ( );
exit ( );
} else {
if ( j == 1 ) p ( );
else q ( );
d ( );
if ( k == 1 ) r ( );
else s ( );
}
} (a) Example code
(c) Frequent subseq patterns
1
2
3
4
5
afec
abcdef
acbdef
abcdfe
acbdfe
(b) Static program traces
T. Xie Mining Program Source Code
a
b
c
d
e
f
(d) Frequent partial order R
23
Apiartor Overview
User-specified
Scenario Extractor
Related APIs
APIs
Trace
Generator
Trigger Generator
Miner
Triggers
Partial Orders
Model Checker
Source Code
Independent Scenarios
Traces
T. Xie Mining Program Source Code
Specification
Extractor
Frequent
Usage
Scenarios
Specifications
24
Example Partial Orders
XOpenDisplay
XCreateWindow
XCreateGC
XGetWindowAttributes
XSelectInput
XMapWindow
XSetForeground
XGetBackground
XNextEvent
XGetAtomName
XFreeGC
XChageWindowAttributes
XMapWindow
A usage scenario around
XOpenDisplay API as a
partial order.
Specifications are shown
with dotted lines.
XCloseDisplay
T. Xie Mining Program Source Code
25
Sample Projects on
Mining Program Source Code
Data
Set of functions,
variables, etc. in a C
function
Statement seq in a
basic block in C
Algorithms
Frequent
itemset
Frequent
subsequence
Copy-paste bug finding
UIUC [OSDI 04]
Methods seq in a
Java method from
code search engine
Function seq in whole
C program
Frequent
subsequence
API usage patterns
NCSU [MSR 06]
Frequent
partial order
API usage patterns/properties
NCSU [FSE 07]
System dependence
graph in whole C
program
Java API method
signatures
Frequent
subgraph
Neglected-condition bug
finding CASE [ISSTA 07]
Plan generation
API Jungloids
Berkeley [PLDI 05]
Method seq in a Java Frequent
method from code
sequences
T. Xie Mining
Program Source Code
search
engine
Tasks
Programming-rules-related
bug finding UIUC [FSE 05]
API Jungloids
NCSU [ASE 07]
26
Mining API Usage Patterns
• MAPO: “I know what method call I need, but I
don’t know how to write code before and after this
method call” [Xie&Pei MSR 06]
• Apiartor: “I know what possible set of APIs I need,
but I don’t know what need to be used and what
orders to use” [Acharya et al. FSE 07]
• PARSEWeb: “I know what type of object I need,
but I don’t know how to write the code to get the
object” [Thummalapenta&Xie ASE 07]
T. Xie Mining Program Source Code
27
Example Task - OpenJMS
Sun Java Message Services API Spec
• Query:
“javax.jms.QueueConnectionFactory ->
javax.jms.QueueSender”
• PARSEWeb Solution:
FileName:0_UserBean.java MethodName:ingest Rank:1 NumberOfOccurrences:23
Confidence:True Path: 1 2 3
javax.jms.QueueConnectionFactory,createQueueConnection()
ReturnType:javax.jms.QueueConnection
javax.jms.QueueConnection,createQueueSession(boolean,javax.jms.Session.AUTO
ACKNOWLEDGE) ReturnType:javax.jms.QueueSession
javax.jms.QueueSession,createSender(javax.jms.Queue)
ReturnType:javax.jms.QueueSender
T. Xie Mining Program Source Code
PARSEWeb Overview
Query
Code
Search Engine
Code
Downloader
Local Source
Code Repository
Final Method
Invocation
Sequences
Query
Splitter
T. Xie Mining Program Source Code
Open Source
Repositories
Code
Analyzer
Clustered
Method Invocation
Sequences
Method
Invocation
Sequences
Sequence
Miner
29
PARSEWeb Overview
Query
Code
Search Engine
Code
Downloader
Local Source
Code Repository
Final Method
Invocation
Sequences
Query
Splitter
T. Xie Mining Program Source Code
Open Source
Repositories
Code
Analyzer
Clustered
Method Invocation
Sequences
Method
Invocation
Sequences
Sequence
Miner
30
Code Analyzer
• Collect [Source  Destination] method
sequences invoked by each public method
– Deal with local method calls by inlining methods
– Deal with conditionals/loops by traversing
control flow graphs
• Resolve types in sequences
– Challenges: downloaded files are partial
– Solutions: heuristics are developed
T. Xie Mining Program Source Code
31
Type Heuristics
• Heuristic 1: The return type of a method-invocation
statement contained in an initialization expression is
same as the type of the declared variable.
e.g., QueueConnection connect;
QueueSession session = connect.createQueueSession(false,int)
• Heuristic 2: The return type of an outer most methodinvocation contained in a return statement is same as
the return type of the enclosing method declaration.
e.g., public int test()
{
...
return connect.createQueueSession(false,int);
}
T. Xie Mining Program Source Code
32
PARSEWeb Overview
Query
Code
Search Engine
Code
Downloader
Local Source
Code Repository
Final Method
Invocation
Sequences
Query
Splitter
T. Xie Mining Program Source Code
Open Source
Repositories
Code
Analyzer
Clustered
Method Invocation
Sequences
Method
Invocation
Sequences
Sequence
Miner
33
Sequence Miner
• Candidate sequences produced by the code
analyzer may be too many
Solutions:
• Cluster similar sequences
– Clustering heuristics are developed
• Rank sequences
– Ranking heuristics are developed
T. Xie Mining Program Source Code
34
Clustering Heuristics
• Heuristic 1: Method-invocation sequences with the
same set of statements can be considered similar,
although the statements are in different order.
e.g., ''2 3 4 5'' and ''2 4 3 5 ''
• Heuristic 2: Method-invocation sequences differing
by given cluster precision value can be considered
similar.
e.g., ''8 9 6 7'' and ''8 6 10 7 '' can be considered similar
under cluster precision value one.
T. Xie Mining Program Source Code
35
Ranking Heuristics
• Heuristic 1: Higher frequency -> Higher rank
• Heuristic 2: Shorter length -> Higher rank
T. Xie Mining Program Source Code
36
PARSEWeb Overview
Query
Code
Search Engine
Code
Downloader
Local Source
Code Repository
Final Method
Invocation
Sequences
Query
Splitter
T. Xie Mining Program Source Code
Open Source
Repositories
Code
Analyzer
Clustered
Method Invocation
Sequences
Method
Invocation
Sequences
Sequence
Miner
37
Query Splitter
• Lack of code samples in the results of code
search engines
– Code samples are split among different files
Solution:
• Split the user query into multiple queries
• Compose the results for each split query
T. Xie Mining Program Source Code
Query Splitting Example
1. User query: “org.eclipse.jface.viewers.IStructuredSelection->java.io.ObjectInputStream”
Results: None
2. Query: “java.io.ObjectInputStream”
Results: 3.
Most used sources are: java.io.InputStream, java.io.ByteArrayInputStream,
java.io.FileInputStream
3. Three Queries to be fired:
“org.eclipse.jface.viewers.IStructuredSelection-> java.io.InputStream”
Results: 1
“org.eclipse.jface.viewers.IStructuredSelection-> java.io.ByteArrayInputStream”
Results: 5
“org.eclipse.jface.viewers.IStructuredSelection-> java.io.FileInputStream”
Results: None
T. Xie Mining Program Source Code
Eclipse Plugin
T. Xie Mining Program Source Code
40
Evaluations
• Real Programming Problems: To address problems posted
in developer forums.
• Real Projects: To show that solutions recommended by
PARSEWeb are
– available in real projects
– better than solutions recommended by related tools PROSPECTOR,
Strathcona, Google Code Search Engine averagely
T. Xie Mining Program Source Code
Jakarta BCEL User Forum
• Jakarta BCEL user forum, 2001
Problem: “How to disassemble java byte code”
Query: “Code  Instruction”
Solution Sample Code:
Code code;
InstructionList il = new InstructionList(code.getCode());
Instruction[] ins = il.getInstructions();
T. Xie Mining Program Source Code
Dev2Dev Newsgroups
• Dev 2 Dev Newsgroups, 2006
Problem: “how to connect db by sesseionBean”
Query: javax.naming.InitialContext  java.sql.Connection
Solution Sequence:
FileName:3 AddressBean.java MethodName:getNextUniqueKey Rank:1
NumberOfOccurrences:34
javax.naming.InitialContext,lookup(java.lang.String)
ReturnType:javax.sql.DataSource
javax.sql.DataSource,getConnection()
ReturnType:java.sql.Connection
T. Xie Mining Program Source Code
Challenges in Mining Code
• Sometimes too few data samples
– Scalability is usually not an issue
– Static code bases vs. change histories
• Data preparation/preprocessing
– Related to traditional program analysis
• Pattern postprocessing (filtering and ranking)
– Heuristics play important roles
• Demand-driven mining vs. any gold mining
– Programming vs. bug finding
T. Xie Mining Program Source Code
Conclusion
• Mining various types of software engineering data
to aid software engineering task
• Mining program source code to improve
programmer productivity
– MAPO: mining API usage patterns for a given API
– Apiartor: mining API usage patterns for a given set of
APIs
– PARSEWeb: mining API usage patterns for inputoutput-type quries
T. Xie Mining Program Source Code
Questions?
Mining Software Engineering Data Bibliography
http://ase.csc.ncsu.edu/dmse/
•What software engineering tasks can be helped by data mining?
•What kinds of software engineering data can be mined?
•How are data mining techniques used in software engineering?
•Resources
Demand-Driven Or Not
Any-gold
mining
DynaMine, …
Demand-driven
mining
MAPO, BugTriage, …
Advantages
Surface up only cases
that are applicable
Exploit demands to filter
out irrelevant information
Issues
How much gold is
How high percentage of
good enough given the cases would work well?
amount of data to be
mined?
Examples
T. Xie Mining Program Source Code
47
Code vs. Non-Code
Examples
Advantages
Code/
Programming Langs
MAPO, DynaMine, …
Non-Code/
Natural Langs
BugTriage, CVS/Code
comments, emails, docs
Relatively stable and
consistent
representation
Common source of
capturing programmers’
intentions
Issues
T. Xie Mining Program Source Code
What project/contextspecific heuristics to use?
48
Static vs. Dynamic
Examples
Static Data: code
Dynamic Data: prog
bases, change histories states, structural profiles
MAPO, DynaMine, …
Spec discovery, …
Advantages
No need to set up exec More-precise info
environment;
More scalable
Issues
How to reduce false
positives?
T. Xie Mining Program Source Code
How to reduce false
negatives?
Where tests come from?
49
Snapshot vs. Changes
Code snapshot
Code change history
Examples
MAPO, …
DynaMine, …
Advantages
Larger amount of
available data
Revision transactions
encode more-focused
entity relationships
Issues
T. Xie Mining Program Source Code
How to group CVS
changes into transactions?
50