Download Mining

Document related concepts
no text concepts found
Transcript
Mining Software Data: Code
Tao Xie
University of Illinois at Urbana-Champaign
http://web.engr.illinois.edu/~taoxie/
[email protected]
Mining Software Engineering Data
MAIN GOAL
Transform static recordkeeping SE data to active
data
 Make SE data actionable
by uncovering hidden
patterns and trends

Bugzilla Mailings
Code
repository
CVS
Execution
traces
2
Mining Software Engineering Data
programming defect detection
testing
debugging
maintenance …
software engineering tasks
data mining techniques
code
bases
change
history
program
states
structural
entities
software engineering data
https://sites.google.com/site/asergrp/dmse
bug
reports/nl …
Mining Software Engineering Data
programming defect detection
testing
debugging
maintenance …
software engineering tasks
data mining techniques
code
bases
change
history
program
states
structural
entities
software engineering data
bug
reports/nl …
Motivation

Programmers commonly reuse APIs of existing
frameworks or libraries
–
–
–
Advantages: High productivity of development
Challenges: Complexity and lack of documentation
Consequences:
•
•
–
Spend more efforts in understanding APIs
Introduce defects in API client code
Solution: Mining API properties as common patterns
across API client code
Frame
works
5
5
Solution-Driven Problem-Driven
Where can I apply X miner?
Basic
mining
algorithms
E.g., association rule mining,
frequent itemset mining…
Advanced
mining
algorithms
E.g., frequent partial order
mining [ESEC/FSE 07]
What patterns do
we really need?
New/adapted
mining
algorithms
E.g., [ICSE 09], [ASE 09]
6
Mining  Searching + Mining
Traditional approaches
Code repositories
1
patterns
mining
2
Eclipse, Linux, …
Often lack sufficient relevant data points (Eg. API call sites)
Our new approaches
Code repositories
1
2
…
Open source code
on the web
N
searching
mining
patterns
Code search engine
e.g.,
7
7
7
Agenda
Motivation
 Mining Sequence Association Rules

(CAR-Miner) [ICSE 09]
 Detecting Exception-Handling Defects

Mining Alternative Patterns
(Alattin) [ASE 09]
 Detecting Neglected Condition Defects

Conclusion
8
Exception Handling

APIs throw exceptions during runtime errors
Example: Session API of Hibernate framework throws
HibernateException

APIs expect client applications to implement
recovery actions after exceptions occur
Example: Hibernate Session API expects client application to
rollback open uncommitted transactions after
HibernateException occurs

Failure to handle exceptions results in
Fatal issues, e.g., database lock won’t be released if the
transaction is not rolled back
9
Problem Addressed by CAR-Miner

Use exception-handling specification to detect
violations as defects
Problem: Often specifications are not documented
 Solution: Mine specifications from existing API client code
 Challenges:



Limited data points: Only from a few code bases 
searching + mining
Limited expressiveness: Not sufficient to characterize
common exception-handling behaviors: why?
10
Example
Scenario 1
1.1: ...
1.2: OracleDataSource ods = null; Session session = null;
Connection conn = null; Statement statement = null;
1.3: logger.debug("Starting update");
1.4: try {
1.5:
ods = new OracleDataSource();
1.6:
ods.setURL("jdbc:oracle:thin:scott/[email protected]:1521:catfish");
1.7:
conn = ods.getConnection();
1.8:
statement = conn.createStatement();
1.9:
statement.executeUpdate("DELETE FROM table1");
1.10: connection.commit(); }
1.11: catch (SQLException se) {
1.13:
logger.error("Exception occurred"); }
1.14: finally {
1.15: if(statement != null) { statement.close(); }
1.16: if(conn != null) { conn.close(); }
1.17: if(ods != null) { ods.close(); } }
1.18: }



Defect: No rollback done
when SQLException occurs
Requires specification such
as “Connection should be
rolled back when a
connection is created and
SQLException occurs”
Q: Should every connection
instance has to be rolled
back when SQLException
occurs?
Missing “conn.rollback()”
11
Example (cont.)
Scenario 1
1.1: ...
1.2: OracleDataSource ods = null; Session session = null;
Connection conn = null; Statement statement = null;
1.3: logger.debug("Starting update");
1.4: try {
1.5:
ods = new OracleDataSource();
1.6:
ods.setURL("jdbc:oracle:thin:scott/[email protected]:1521:catfish");
1.7:
conn = ods.getConnection();
1.8:
c statement = conn.createStatement();
1.9:
statement.executeUpdate("DELETE FROM table1");
1.10: connection.commit(); }
1.11: catch (SQLException se) {
1.12:
if (conn != null) { conn.rollback(); }
1.13:
logger.error("Exception occurred"); }
1.14: finally {
1.15: if(statement != null) { statement.close(); }
1.16: if(conn != null) { conn.close(); }
1.17: if(ods != null) { ods.close(); } }
1.18: }
Scenario 2
2.1: Connection conn = null;
2.2: Statement stmt = null;
2.3: BufferedWriter bw = null; FileWriter fw = null;
2.3: try {
2.4:
fw = new FileWriter("output.txt");
2.5:
bw = BufferedWriter(fw);
2.6:
conn = DriverManager.getConnection("jdbc:pl:db", "ps", "ps");
2.7:
Statement stmt = conn.createStatement();
2.8:
ResultSet res = stmt.executeQuery("SELECT Path FROM Files");
2.9:
while (res.next()) {
2.10:
bw.write(res.getString(1));
2.11: }
2.12: res.close();
2.13: } catch(IOException ex) { logger.error("IOException occurred");
2.14: } finally {
2.15: if(stmt != null) stmt.close();
2.16: if(conn != null) conn.close();
2.17: if (bw != null) bw.close();
2.18: }
Specification: “Connection creation => Connection rollback”

Satisfied by Scenario 1 but not by Scenario 2

But Scenario 2 has no defect
12
Example (cont.)


Simple association rules of the form “FCa => FCe” are
not expressive
Requires more general association rules (sequence
association rules) such as
(FCc1 FCc2) Λ FCa => FCe1, where
FCc1 -> Connection conn = OracleDataSource.getConnection()
FCc2 -> Statement stmt = Connection.createStatement()
FCa -> stmt.executeUpdate()
FCe1 -> conn.rollback()
13
Example (cont.)


Simple association rules of the form “FCa => FCe” are
not expressive
Requires more general association rules (sequence
association rules) such as
(FCc1 FCc2) Λ FCa => FCe1, where
FCc1 -> Connection conn = OracleDataSource.getConnection()
FCc2 -> Statement stmt = Connection.createStatement()
FCa -> stmt.executeUpdate() //Triggering Action
FCe1 -> conn.rollback()
14
Example (cont.)


Simple association rules of the form “FCa => FCe” are
not expressive
Requires more general association rules (sequence
association rules) such as
(FCc1 FCc2) Λ FCa => FCe1, where
FCc1 -> Connection conn = OracleDataSource.getConnection()
FCc2 -> Statement stmt = Connection.createStatement()
FCa -> stmt.executeUpdate()
FCe1 -> conn.rollback() //Recovery Action
15
Example (cont.)


Simple association rules of the form “FCa => FCe” are
not expressive
Requires more general association rules (sequence
association rules) such as
(FCc1 FCc2) Λ FCa => FCe1, where
FCc1 -> Connection conn = OracleDataSource.getConnection()
FCc2 -> Statement stmt = conn.createStatement() //Context
FCa -> stmt.executeUpdate()
FCe1 -> conn.rollback()
16
CAR-Miner Approach
Issue queries and collect relevant code
examples. Eg: “lang:java
java.sql.Statement executeUpdate”
Construct exceptionflow graphs
Open Source Projects on web
1
2
…
…
N
Exception-Flow
Graphs
Static Traces
Classes and
Functions
Extract classes
and functions
reused
Input
Application
Check whether there are
any exception-related
defects
Collect static traces
Mine static traces
Violations
Sequence
Association
Rules
Detect violations
17
CAR-Miner Approach
Open Source Projects on web
1
2
…
…
N
Exception-Flow
Graphs
Static Traces
Classes and
Functions
Input
Application
Violations
Sequence
Association
Rules
18
Exception-Flow-Graph Construction

Based on a previous algorithm [Sinha&Harrold TSE 00]
: normal execution path
----: exceptional execution path
Exception-Flow-Graph Construction

Prevent infeasible edges using a sound static analysis
[Robillard&Murphy FSE 99]
20
CAR-Miner Approach
Open Source Projects on web
1
2
…
…
N
Exception-Flow
Graphs
Static Traces
Classes and
Methods
Input
Application
Violations
Sequence
Association
Rules
21
Static Trace Generation


Collect static traces with the actions
taken when exceptions occur
A static trace for Node 7:
“4 -> 5 -> 6 -> 7 -> 15 -> 16 -> 17”
22
Static Trace Generation

Includes 3 sections:




Normal functioncall sequence (4
-> 5 -> 6)
Function call (7)
Exception
function-call
sequence (15 ->
16 -> 17)
A static trace for Node 7: “4 -> 5 -> 6 -> 7 -> 15 -> 16 -> 17”
23
Trace Post-Processing


Identify and remove unrelated function
calls using data dependency
“4 -> 5 -> 6 -> 7 -> 15 -> 16 -> 17”
4: FileWriter fw = new FileWriter(“output.txt”)
5: BufferedWriter bw = new BufferedWriter(fw)
...
7: Statement stmt = conn.createStatement()
...

Filtered sequence “6 -> 7 -> 15 -> 16“
24
CAR-Miner Approach
Open Source Projects on web
1
2
…
…
N
Exception-Flow
Graphs
Static Traces
Classes and
Methods
Input
Application
Violations
Sequence
Association
Rules
25
Static Trace Mining


Handle traces of each function call (triggering
function call) individually
Input: Two sequence databases with a one-to-one
mapping
•
•

normal function-call sequences (context)
exception function-call sequences (recovery)
Objective: Generate sequence association rules of the
form
(FCc1 ... FCcn) Λ FCa => FCe1 ... FCen
Context
Trigger
Recovery
26
Mining Problem Definition

Input: Two sequence databases with a one-to-one mapping
Context

Recovery
Objective: To get association rules of the form
FC1 FC2 ... FCm -> FE1 FE2 ... FEn
where {FC1, FC2, ..., Fcm} Є SDB1 and {FE1, FE2, ..., Fen} Є SDB2

Existing association rule mining algorithms cannot be directly
applied on multiple sequence databases
27
Mining Problem Solution



Annotate the sequences to generate a single combined database
Apply frequent subsequence mining algorithm [Wang and Han, ICDE 04] to
get frequent sequences
Transform mined sequences into sequence association rules
(3 10) Λ FCa => (2 8)
Context Trigger

Recovery
Rank rules based on the support assigned by frequent
subsequence mining algorithm
28
CAR-Miner Approach
Open Source Projects on web
1
2
…
…
N
Exception-Flow
Graphs
Static Traces
Classes and
Methods
Input
Application
Violations
Sequence
Association
Rules
29
Violation Detection

Analyze each call site of triggering call FCa
 Step 1: Extract context call sequence “CC1
CC2 ... CCm” from the beginning of the
function to the call site of FCa
 Step 2: If CC1 CC2 ... CCm is super-sequence
of FCc1 ... FCcn

Report any missing function calls of {FCe1 ... FCen} in
any exception path
API client: (CC1 CC2 ... CCm) Λ FCa => Missing any?
isSuperSeqOf
API Rule: (FCc1 ... FCcn)
Context
Λ FCa => FCe1 ... FCen
Trigger
Recovery
30
Evaluation
Research Questions:
Do the mined rules represent real rules?
2. Do the detected violations represent real
defects?
3. Does CAR-Miner perform better than WNminer [Weimer and Necula, TACAS 05]?
4. Do the sequence association rules help
detect new defects?
1.
31
Subjects



Internal Info: classes and methods belonging to the app
External Info: classes and methods used by the app
Code examples: #files collected through code search engine
32
RQ1: Real Rules

Do the mined rules represent real rules?
Real rules: 55% (Total: 294)
Usage patterns: 3%
False positives: 43%
33
RQ1: Distribution of Real Rules for Axion


Distribution of rules based on ranks assigned by CAR-Miner
#false positives is quite low between 1 to 60 rules
34
RQ2: Detected Violations



Do the detected violations represent real defects?
Total number of defects: 160
New defects not found by WN-Miner approach: 87
35
RQ2: Status of Detected Violations


HsqlDB developers responded on the first 10 reported
defects
 Accepted 7 defects
 Rejected 3 defects
Reason given by HsqlDB developers for rejected defects:
“Although it can throw exceptions in general, it should not throw with
HsqlDB, So it is fine”
36
RQ3: Comparison with WN-miner




Does CAR-Miner performs better than WN-miner?
Found 224 new rules and missed 32 rules
CAR-Miner detected most of the rules mined by WN-miner
Two major factors:
 sequence association rules
 Increase in the data scope
37
RQ4: New defects by sequence association rules


Do the sequence association rules detect new defects?
Detected 21 new real defects among all applications
38
Agenda
Motivation
 Mining Sequence Association Rules

(CAR-Miner) [ICSE 09]
 Detecting Exception-Handling Defects

Mining Alternative Patterns
(Alattin) [ASE 09]
 Detecting Neglected Condition Defects

Conclusion
39
Large Number of False Positives


Existing approaches produce a large number of false
positives
One major observation:
 Programmers often write code in different ways for
achieving the same task
 Some ways are more frequent than others
Frequent
ways
mine patterns
Infrequent
ways
detect violations
Mined Patterns
40
40
Example: java.util.Iterator.next()
Java.util.Iterator.next() throws NoSuchElementException when invoked on a list
without any elements Code Example 2
Code Sample 1
PrintEntries1(ArrayList<string>
entries)
{
…
Iterator it = entries.iterator();
if(it.hasNext()) {
string last = (string) it.next();
}
…
}
Code Sample 2
PrintEntries2(ArrayList<string>
entries)
{
…
if(entries.size() > 0) {
Iterator it = entries.iterator();
string last = (string) it.next();
}
…
}
41
Example: java.util.Iterator.next()
Code Sample 1
PrintEntries1(ArrayList<string>
entries)
{
…
Iterator it = entries.iterator();
if(it.hasNext()) {
string last = (string) it.next();
}
…
}
Code Sample 2
PrintEntries2(ArrayList<string>
entries)
{
…
if(entries.size() > 0) {
Iterator it = entries.iterator();
string last = (string) it.next();
}
…
}
Sample 2 (6/1243)
Sample 1 (1218 / 1243)
1243 code examples
Mined Pattern from existing approaches:
“boolean check on return of Iterator.hasNext before Iterator.next”
42
Example: java.util.Iterator.next()
Code Sample 1
PrintEntries1(ArrayList<string>
entries)
{
…
Iterator it = entries.iterator();
if(it.hasNext()) {
string last = (string) it.next();
}
…
}

Code Sample 2
PrintEntries2(ArrayList<string>
entries)
{
…
if(entries.size() > 0) {
Iterator it = entries.iterator();
string last = (string) it.next();
}
…
}
Require more general patterns (alternative patterns): P1 or P2
P1 : boolean check on return of Iterator.hasNext before Iterator.next
P2 : boolean check on return of ArrayList.size before Iterator.next
Cannot be mined by existing approaches, since alternative P2
is infrequent

43
Our Solution: ImMiner Algorithm


Mines alternative patterns of the form P1 or P2
Based on the observation that infrequent alternatives such as P2 are
frequent among code examples that do not support P1
1243 code examples
Sample 2 (6/1243)
Sample 1 (1218 / 1243)
P2 is infrequent among entire
1243 code examples
P2 is frequent among code
examples not supporting P1
44
Alternative Patterns

ImMiner mines three kinds of alternative
patterns of the general form “P1 or P2”
Balanced: all alternatives (both P1 and P2) are frequent
Imbalanced: some alternatives (P1) are frequent and
others are infrequent (P2). Represented as “P1 or P^2”
Single: only one alternative
45
ImMiner Algorithm
Uses frequent-itemset mining [Burdick et al. ICDE 01]
iteratively
 An input database with the following APIs
for Iterator.next()

Input database
Mapping of IDs to APIs
46
ImMiner Algorithm: Frequent Alternatives
Input database
Frequent itemset
mining
(min_sup 0.5)
Frequent item: 1
P1: boolean-check on the return of
Iterator.hasNext() before Iterator.next()
47
ImMiner: Infrequent Alternatives of P1

Split input database into two databases: Positive and Negative
Positive database (PSD)
Negative database (NSD)
Mine patterns that are frequent in NSD and are infrequent in PSD
Reason: Only such patterns serve as alternatives for P1
 Alternative Pattern : P “const check on the return of ArrayList.size()
2
before Iterator.next()”
 Alattin applies ImMiner algorithm to detect neglected conditions


48
Neglected Conditions

Neglected conditions refer to
Missing conditions that check the arguments or
receiver of the API call before the API call
 Missing conditions that check the return or
receiver of the API call after the API call


One of the primary reasons for many fatal
issues

security or buffer-overflow vulnerabilities [Chang et
al. ISSTA 07]
49
Alattin Approach
Phase 1: Issue queries and collect
relevant code samples. Eg: “lang:java
java.util.Iterator next”
Phase 2: Generate
pattern candidates
Open Source Projects on web
1
2
…
N
Pattern
Candidates
…
Classes and
methods
Extract classes
and methods
reused
Application
Under Analysis
Phase 3: Mine
alternative patterns
Violations
Detect neglected
conditions
Alternative
Patterns
Phase 4: Detect neglected
conditions statically
50
Evaluation

1.
2.
Research Questions:
Do alternative patterns exist in real
applications?
How high percentage of false positives are
reduced (with low or no increase of false
negatives) in detected violations?
51
Subjects
#Samples: #code examples collected from Google code search

Two categories of subjects:
 3 Java default API libraries
 3 popular open source libraries
52
RQ1: Balanced and Imbalanced Patterns

How high percentage of balanced and imbalanced patterns exist in real apps?



Balanced patterns: 0% to 30% (average: 9.69%)
Imbalanced patterns:

30% to 100% (average: 65%) for Java default API libraries

0% to 9.5% (average: 5%) for open source libraries
Explanation: Java default API libraries provide more different ways of
writing code compared to open source libraries
53
RQ2: False Positives and False Negatives


How high % of false positives are reduced (with low or no increase of
false negatives)?
Applied mined patterns (“P1 or P2 or ... or Pi or A^1 or A^2 or ... or A^j ”)
in three modes:

Existing mode:
“P1 or P2 or ... or Pi or A^1 or A^2 or ... or A^j ”
 P1 ,P2, ... , Pi

Balanced mode:
“P1 or P2 or ... or Pi or A^1 or A^2 or ... or A^j ”
 “P1 or P2 or ... or Pi”

Imbalanced mode:
“P1 or P2 or ... or Pi or A^1 or A^2 or ... or A^j ”
 “P1 or P2 or ... or Pi or A^1 or A^2 or ... or A^j ”54
RQ2: False Positives and False Negatives

Existing Mode vs Balanced Mode
Application
Existing Mode
Balanced Mode
Defects
False
Positives
Defects
False
Positives
% of
reduction
False
Negatives
Java Util
37
104
37
104
0
0
Java
Transaction
51
105
51
105
0
0
Java SQL
56
143
56
90
37.06
0
BCEL
2
14
2
8
42.86
0
HSqlDB
1
0
1
0
0
0
Hibernate
10
9
10
8
11.11
0
15.17
0
AVERAGE/
TOTAL

Balanced mode reduced false positives by 15.17% without
any increase in false negatives
55
RQ2: False Positives and False Negatives
Existing Mode vs Imbalanced Mode

Application
Existing Mode
Imbalanced Mode
Defects
False
Positives
Defects
False
Positives
% of
reduction
False
Negatives
Java Util
37
104
36
74
28.85
1
Java
Transaction
51
105
47
76
27.62
4
Java SQL
56
143
53
81
43.36
3
BCEL
2
14
2
6
57.04
0
HSqlDB
1
0
1
0
0
0
Hibernate
10
9
10
8
11.11
0
28.01
8
AVERAGE/
TOTAL

Imbalanced mode reduced false positives by 28% with quite
small increase in false negatives
56
Conclusion


Problem-driven methodology by identifying
 new problems, patterns
 mining algorithms, defects
CAR-Miner [ICSE 09]: mining sequence association
rules of the form
(FCc1 ... FCcn) Λ FCa => (FCe1 ... Fcen)
Context

Trigger
Recovery
 reduce false negatives
Alattin [ASE 09]: mining alternative patterns classified
into three categories: balanced, imbalanced, and single
P1 or P2 or ... or Pi or A^1 or A^2 or ... or A^j
 reduce false positives
57
Thank You
Questions?
Tao Xie
University of Illinois at Urbana-Champaign
http://web.engr.illinois.edu/~taoxie/
[email protected]
58
Related documents