Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Improving Programmer Productivity via Mining Program Source Code Tao Xie Department of Computer Science North Carolina State University http://ase.csc.ncsu.edu/dmse/ Mining SE Data • MAIN GOAL – Transform static recordkeeping SE data to active data – Make SE data actionable by uncovering hidden patterns and trends Bugzilla Mailings Code repository T. Xie Mining Program Source Code CVS Execution traces 2 Overview of Mining SE Data programming defect detection testing debugging maintenance … software engineering tasks helped by data mining association/ patterns classification clustering … data mining techniques code bases change history program states structural entities bug reports/nl … software engineering data T. Xie Mining Program Source Code 3 Overview of Mining SE Data 99 ASE 00 ICSE 05 FSE*2 ASE PLDI POPL OSDI 06 PLDI OOPSLA KDD 07 ICSE*3 FSE*3 ASE PLDI*2 ISSTA*2 KDD code bases 99 ICSE 02 ICSE 03 PLDI 05 FSE PLDI 06 ISSTA 07 ISSTA 04 ICSE 05 FSE*2 06 ASE 07 ICSE*2 change history program states 99 FSE 01 ICSE FSE 02 ISSTA POPL KDD 03 PLDI 04 ASE ISSTA 05 ICSE ASE 06 ICSE FSE*2 07 PLDI 03 ICSE 06 ICSE 06 ASE 07 ICSE SOSP structural entities bug reports/nl … software engineering data T. Xie Mining Program Source Code 4 Overview of Mining SE Data programming defect detection testing debugging maintenance … software engineering tasks helped by data mining association/ patterns classification clustering … data mining techniques code bases change history program states structural entities bug reports/nl … software engineering data T. Xie Mining Program Source Code 5 Overview of Mining SE Data programming defect detection testing debugging maintenance … software engineering tasks helped by data mining 99 ASE 00 ICSE 05 FSE PLDI POPL 06 FSE OOPSLA PLDI 07 FSE ASE ISSTA KDD 01 SOSP 04 OSDI 05 FSE*2 06 ICSE*2 07 ICSE*2 FSE*2 ISSTA PLDI*2 SOSP T. Xie Mining Program Source Code 99 ICSE 01 ICSE*2 FSE 02 ICSE ISSTA POPL 04 ISSTA 06 ISSTA 03 ICSE PLDI*2 05 ICSE FSE ASE PLDI 06 ICSE FSE 07 ICSE ISSTA PLDI 02 KDD 04 ICSE ASE 05 FSE ASE*2 06 KDD 07 ICSE*3 6 Overview of Mining SE Data programming defect detection testing debugging maintenance … software engineering tasks helped by data mining association/ patterns classification clustering … data mining techniques code bases change history program states structural entities bug reports/nl … software engineering data T. Xie Mining Program Source Code 7 Sample Projects on Mining Program Source Code Data Set of functions, variables, etc. in a C function Statement seq in a basic block in C Algorithms Frequent Itemset Frequent subsequence Copy-paste bug finding UIUC [OSDI 04] Methods seq in a Java method from code search engine Function seq in whole C program Frequent subsequence API usage patterns NCSU [MSR 06] Frequent partial order API usage patterns/properties NCSU [FSE 07] System dependence graph in whole C program Java API method signatures Frequent subgraph Neglected-condition bug finding CASE [ISSTA 07] Plan generation API Jungloids Berkeley [PLDI 05] Method seq in a Java Frequent method from code sequences T. Xie Mining Program Source Code search engine Tasks Programming-rules-related bug finding UIUC [FSE 05] API Jungloids NCSU [ASE 07] 8 Some Recent Trends • Data: dynamic execution data +static code bases • Task: productivity (programming) + quality (defect detection, testing, debugging) • Mining algorithm: simple ones (association rule) + frequent itemset/subsequence/ partial order/subgraph • Data scope: local repositories public repositories with code search engines T. Xie Mining Program Source Code 9 Sample Projects on Mining Program Source Code Data Set of functions, variables, etc. in a C function Statement seq in a basic block in C Algorithms Frequent itemset Frequent subsequence Copy-paste bug finding UIUC [OSDI 04] Methods seq in a Java method from code search engine Function seq in whole C program Frequent subsequence API usage patterns NCSU [MSR 06] Frequent partial order API usage patterns/properties NCSU [FSE 07] System dependence graph in whole C program Java API method signatures Frequent subgraph Neglected-condition bug finding CASE [ISSTA 07] Plan generation API Jungloids Berkeley [PLDI 05] Method seq in a Java Frequent method from code sequences T. Xie Mining Program Source Code search engine Tasks Programming-rules-related bug finding UIUC [FSE 05] API Jungloids NCSU [ASE 07] 10 Mining API Usage Patterns • How should an API be used correctly? – An API may serve multiple functionalities – Different styles of API usage • MAPO: “I know what method call I need, but I don’t know how to write code before and after this method call” [Xie&Pei MSR 06] T. Xie Mining Program Source Code 11 Example Task -- MAPO • “instrument the bytecode of a Java class by adding an extra method to the class” – org.apache.bcel.generic.ClassGen public void addMethod(Method m) T. Xie Mining Program Source Code 12 First Try: ClassGen Java API Doc addMethod public void addMethod(Method m) Add a method to this class. Parameters: m - method to add T. Xie Mining Program Source Code 13 Second Try: Code Search Engine T. Xie Mining Program Source Code 14 MAPO Approach • Analyze code segments relevant to a given API and disclose the inherent usage patterns – Input: an API characterized by a method, class, or package – Code search engine: used to search relevant source files from open source repositories – Frequent sequence miner: use BIDE [Wang&Han 04] to mine closed sequential patterns from extracted methodcall sequences – Output: a short list of frequent API usage patterns related to the API T. Xie Mining Program Source Code 15 Sequence Extraction • Method sequences: extracted from Java source files returned from code search engines Source code Call sequence public void generateStubMethod(ClassGen c) InstructionList il = InstructionList.<init>() new InstructionList(); genFromISList(InstructionList) MethodGen m= genFromISList(il); MethodGen.setMaxStack() m.setMaxLocals(); MethodGen.setMaxLocals() m.setMaxStack(); MethodGen.getMethod() c.addMethod(m.getMethod()); ClassGen.addMethod(Method) System.out.println(“…”); PrintStream.println(String) … … } T. Xie Mining Program Source Code 16 Sequence Preprocessing • Remove common Java library calls • Inline callees of the same class • Remove sequences that contain no query words: ClassGen and addMethod public void generateStubMethod(ClassGen c) InstructionList il = InstructionList.<init>() new InstructionList(); genFromISList(InstructionList) MethodGen m= genFromISList(il); MethodGen.setMaxStack() m.setMaxLocals(); MethodGen.setMaxLocals() m.setMaxStack(); MethodGen.getMethod() c.addMethod(m.getMethod()); ClassGen.addMethod(Method) System.out.println(“…”); PrintStream.println(String) … … } T. Xie Mining Program Source Code 17 Frequent Seq Postprocessing • Remove sequences that contain no query words: ClassGen and addMethod • Compress consecutive calls of the same method into one, e.g., abbba aba • Remove duplicate frequent sequences after the compression, e.g., aba, aba aba • Reduce a seq if it is a subseq of another, e.g., aba, abab abab T. Xie Mining Program Source Code 18 Tool Architecture e.g. koders.com T. Xie Mining Program Source Code 19 Sample Mined API Sequence InstructionList.<init>() InstructionFactory.createLoad(Type, int) InstructionList.append(Instruction) InstructionFactory.createReturn(Type) InstructionList.append(Instruction) MethodGen.setMaxStack() MethodGen.setMaxLocals() MethodGen.getMethod() ClassGen.addMethod(Method) InstructionList.dispose() T. Xie Mining Program Source Code 20 Sample Projects on Mining Program Source Code Data Set of functions, variables, etc. in a C function Statement seq in a basic block in C Algorithms Frequent itemset Frequent subsequence Copy-paste bug finding UIUC [OSDI 04] Methods seq in a Java method from code search engine Function seq in whole C program Frequent subsequence API usage patterns NCSU [MSR 06] Frequent partial order API usage patterns/properties NCSU [FSE 07] System dependence graph in whole C program Java API method signatures Frequent subgraph Neglected-condition bug finding CASE [ISSTA 07] Plan generation API Jungloids Berkeley [PLDI 05] Method seq in a Java Frequent method from code sequences T. Xie Mining Program Source Code search engine Tasks Programming-rules-related bug finding UIUC [FSE 05] API Jungloids NCSU [ASE 07] 21 Mining API Usage Patterns • MAPO: “I know what method call I need, but I don’t know how to write code before and after this method call” [Xie&Pei MSR 06] • Apiartor: “I know what possible set of APIs I need, but I don’t know what need to be used and what orders to use” [Acharya et al. FSE 07] T. Xie Mining Program Source Code 22 Usage Patterns as Partial Order abde abdf acde acdf #include <abcdef.h> void p ( ) { b ( ); c ( ); } void q ( ) { c ( ); b ( ); } void r ( ) { e ( ); f ( ); } void s ( ) { f ( ); e ( ); } int main ( ) { int i, j, k; a ( ); if ( i == 1) { f ( ); e ( ); c ( ); exit ( ); } else { if ( j == 1 ) p ( ); else q ( ); d ( ); if ( k == 1 ) r ( ); else s ( ); } } (a) Example code (c) Frequent subseq patterns 1 2 3 4 5 afec abcdef acbdef abcdfe acbdfe (b) Static program traces T. Xie Mining Program Source Code a b c d e f (d) Frequent partial order R 23 Apiartor Overview User-specified Scenario Extractor Related APIs APIs Trace Generator Trigger Generator Miner Triggers Partial Orders Model Checker Source Code Independent Scenarios Traces T. Xie Mining Program Source Code Specification Extractor Frequent Usage Scenarios Specifications 24 Example Partial Orders XOpenDisplay XCreateWindow XCreateGC XGetWindowAttributes XSelectInput XMapWindow XSetForeground XGetBackground XNextEvent XGetAtomName XFreeGC XChageWindowAttributes XMapWindow A usage scenario around XOpenDisplay API as a partial order. Specifications are shown with dotted lines. XCloseDisplay T. Xie Mining Program Source Code 25 Sample Projects on Mining Program Source Code Data Set of functions, variables, etc. in a C function Statement seq in a basic block in C Algorithms Frequent itemset Frequent subsequence Copy-paste bug finding UIUC [OSDI 04] Methods seq in a Java method from code search engine Function seq in whole C program Frequent subsequence API usage patterns NCSU [MSR 06] Frequent partial order API usage patterns/properties NCSU [FSE 07] System dependence graph in whole C program Java API method signatures Frequent subgraph Neglected-condition bug finding CASE [ISSTA 07] Plan generation API Jungloids Berkeley [PLDI 05] Method seq in a Java Frequent method from code sequences T. Xie Mining Program Source Code search engine Tasks Programming-rules-related bug finding UIUC [FSE 05] API Jungloids NCSU [ASE 07] 26 Mining API Usage Patterns • MAPO: “I know what method call I need, but I don’t know how to write code before and after this method call” [Xie&Pei MSR 06] • Apiartor: “I know what possible set of APIs I need, but I don’t know what need to be used and what orders to use” [Acharya et al. FSE 07] • PARSEWeb: “I know what type of object I need, but I don’t know how to write the code to get the object” [Thummalapenta&Xie ASE 07] T. Xie Mining Program Source Code 27 Example Task - OpenJMS Sun Java Message Services API Spec • Query: “javax.jms.QueueConnectionFactory -> javax.jms.QueueSender” • PARSEWeb Solution: FileName:0_UserBean.java MethodName:ingest Rank:1 NumberOfOccurrences:23 Confidence:True Path: 1 2 3 javax.jms.QueueConnectionFactory,createQueueConnection() ReturnType:javax.jms.QueueConnection javax.jms.QueueConnection,createQueueSession(boolean,javax.jms.Session.AUTO ACKNOWLEDGE) ReturnType:javax.jms.QueueSession javax.jms.QueueSession,createSender(javax.jms.Queue) ReturnType:javax.jms.QueueSender T. Xie Mining Program Source Code PARSEWeb Overview Query Code Search Engine Code Downloader Local Source Code Repository Final Method Invocation Sequences Query Splitter T. Xie Mining Program Source Code Open Source Repositories Code Analyzer Clustered Method Invocation Sequences Method Invocation Sequences Sequence Miner 29 PARSEWeb Overview Query Code Search Engine Code Downloader Local Source Code Repository Final Method Invocation Sequences Query Splitter T. Xie Mining Program Source Code Open Source Repositories Code Analyzer Clustered Method Invocation Sequences Method Invocation Sequences Sequence Miner 30 Code Analyzer • Collect [Source Destination] method sequences invoked by each public method – Deal with local method calls by inlining methods – Deal with conditionals/loops by traversing control flow graphs • Resolve types in sequences – Challenges: downloaded files are partial – Solutions: heuristics are developed T. Xie Mining Program Source Code 31 Type Heuristics • Heuristic 1: The return type of a method-invocation statement contained in an initialization expression is same as the type of the declared variable. e.g., QueueConnection connect; QueueSession session = connect.createQueueSession(false,int) • Heuristic 2: The return type of an outer most methodinvocation contained in a return statement is same as the return type of the enclosing method declaration. e.g., public int test() { ... return connect.createQueueSession(false,int); } T. Xie Mining Program Source Code 32 PARSEWeb Overview Query Code Search Engine Code Downloader Local Source Code Repository Final Method Invocation Sequences Query Splitter T. Xie Mining Program Source Code Open Source Repositories Code Analyzer Clustered Method Invocation Sequences Method Invocation Sequences Sequence Miner 33 Sequence Miner • Candidate sequences produced by the code analyzer may be too many Solutions: • Cluster similar sequences – Clustering heuristics are developed • Rank sequences – Ranking heuristics are developed T. Xie Mining Program Source Code 34 Clustering Heuristics • Heuristic 1: Method-invocation sequences with the same set of statements can be considered similar, although the statements are in different order. e.g., ''2 3 4 5'' and ''2 4 3 5 '' • Heuristic 2: Method-invocation sequences differing by given cluster precision value can be considered similar. e.g., ''8 9 6 7'' and ''8 6 10 7 '' can be considered similar under cluster precision value one. T. Xie Mining Program Source Code 35 Ranking Heuristics • Heuristic 1: Higher frequency -> Higher rank • Heuristic 2: Shorter length -> Higher rank T. Xie Mining Program Source Code 36 PARSEWeb Overview Query Code Search Engine Code Downloader Local Source Code Repository Final Method Invocation Sequences Query Splitter T. Xie Mining Program Source Code Open Source Repositories Code Analyzer Clustered Method Invocation Sequences Method Invocation Sequences Sequence Miner 37 Query Splitter • Lack of code samples in the results of code search engines – Code samples are split among different files Solution: • Split the user query into multiple queries • Compose the results for each split query T. Xie Mining Program Source Code Query Splitting Example 1. User query: “org.eclipse.jface.viewers.IStructuredSelection->java.io.ObjectInputStream” Results: None 2. Query: “java.io.ObjectInputStream” Results: 3. Most used sources are: java.io.InputStream, java.io.ByteArrayInputStream, java.io.FileInputStream 3. Three Queries to be fired: “org.eclipse.jface.viewers.IStructuredSelection-> java.io.InputStream” Results: 1 “org.eclipse.jface.viewers.IStructuredSelection-> java.io.ByteArrayInputStream” Results: 5 “org.eclipse.jface.viewers.IStructuredSelection-> java.io.FileInputStream” Results: None T. Xie Mining Program Source Code Eclipse Plugin T. Xie Mining Program Source Code 40 Evaluations • Real Programming Problems: To address problems posted in developer forums. • Real Projects: To show that solutions recommended by PARSEWeb are – available in real projects – better than solutions recommended by related tools PROSPECTOR, Strathcona, Google Code Search Engine averagely T. Xie Mining Program Source Code Jakarta BCEL User Forum • Jakarta BCEL user forum, 2001 Problem: “How to disassemble java byte code” Query: “Code Instruction” Solution Sample Code: Code code; InstructionList il = new InstructionList(code.getCode()); Instruction[] ins = il.getInstructions(); T. Xie Mining Program Source Code Dev2Dev Newsgroups • Dev 2 Dev Newsgroups, 2006 Problem: “how to connect db by sesseionBean” Query: javax.naming.InitialContext java.sql.Connection Solution Sequence: FileName:3 AddressBean.java MethodName:getNextUniqueKey Rank:1 NumberOfOccurrences:34 javax.naming.InitialContext,lookup(java.lang.String) ReturnType:javax.sql.DataSource javax.sql.DataSource,getConnection() ReturnType:java.sql.Connection T. Xie Mining Program Source Code Challenges in Mining Code • Sometimes too few data samples – Scalability is usually not an issue – Static code bases vs. change histories • Data preparation/preprocessing – Related to traditional program analysis • Pattern postprocessing (filtering and ranking) – Heuristics play important roles • Demand-driven mining vs. any gold mining – Programming vs. bug finding T. Xie Mining Program Source Code Conclusion • Mining various types of software engineering data to aid software engineering task • Mining program source code to improve programmer productivity – MAPO: mining API usage patterns for a given API – Apiartor: mining API usage patterns for a given set of APIs – PARSEWeb: mining API usage patterns for inputoutput-type quries T. Xie Mining Program Source Code Questions? Mining Software Engineering Data Bibliography http://ase.csc.ncsu.edu/dmse/ •What software engineering tasks can be helped by data mining? •What kinds of software engineering data can be mined? •How are data mining techniques used in software engineering? •Resources Demand-Driven Or Not Any-gold mining DynaMine, … Demand-driven mining MAPO, BugTriage, … Advantages Surface up only cases that are applicable Exploit demands to filter out irrelevant information Issues How much gold is How high percentage of good enough given the cases would work well? amount of data to be mined? Examples T. Xie Mining Program Source Code 47 Code vs. Non-Code Examples Advantages Code/ Programming Langs MAPO, DynaMine, … Non-Code/ Natural Langs BugTriage, CVS/Code comments, emails, docs Relatively stable and consistent representation Common source of capturing programmers’ intentions Issues T. Xie Mining Program Source Code What project/contextspecific heuristics to use? 48 Static vs. Dynamic Examples Static Data: code Dynamic Data: prog bases, change histories states, structural profiles MAPO, DynaMine, … Spec discovery, … Advantages No need to set up exec More-precise info environment; More scalable Issues How to reduce false positives? T. Xie Mining Program Source Code How to reduce false negatives? Where tests come from? 49 Snapshot vs. Changes Code snapshot Code change history Examples MAPO, … DynaMine, … Advantages Larger amount of available data Revision transactions encode more-focused entity relationships Issues T. Xie Mining Program Source Code How to group CVS changes into transactions? 50