Download Why concurrent testing is so hard

Document related concepts
no text concepts found
Transcript
IBM Haifa Research Lab
Concurrent Bugs
How to Eliminate Them or Render Them
Harmless
Shmuel Ur
© 2008 IBM Corporation
IBM Haifa Research Lab
Outline
 Why concurrent testing is so hard
– Some bug samples
 Do programmers know how to
write concurrent programs?
– What you don’t know will hurt
you!
 Testing – finding bugs or making
them appear
 Debugging – one of the major
problems of the field
 Healing – can you make the
problem disappear or be a lot less
likely?
2
© 2008 IBM Corporation
IBM Haifa Research Lab
Why is Concurrent Testing Hard?
 Concurrency introduces non-determinism
– Multiple executions of the same test may have different interleavings
(and different results!)
• An interleaving is the relative execution order of the program threads
– Very hard to reproduce and debug
 No useful coverage measures for the interleaving space
 Typically appear only in specific configurations
– Therefore commonly found by users
– Require large configurations to test
 Represent only ~10% of the bugs but an unproportionate number are found
late or by the customer -> Very expensive
The costly effort of testing concurrency at the system level
is seemingly unavoidable
3
© 2008 IBM Corporation
IBM Haifa Research Lab
Classic Java Counter Example
Consider the function increase(), which is a part of a
class that acts as a counter
public void
increase()
{
counter++;
}
Although written as a single “increase” operation,
the “++” operator is actually mapped into three JVM
instructions [load operand, increment, write-back]
4
© 2008 IBM Corporation
IBM Haifa Research Lab
Counter Example, continued
Thread A
…
Load op
Thread B
Context
Switch
increment
write-back
…
…
load op
increment
write-back
…
counter = 4
3
5
© 2008 IBM Corporation
IBM Haifa Research Lab
Simple Protocol
B
A
The train enters the tunnel.
The semaphore automatically turns the light to red.
Signalman A sends a message to signalman B that a train is in the tunnel.
The train exits the tunnel.
Signalman B sends a message to signalman A, that the train has exited the tunnel.
Signalman A sets the light to green.
6
© 2008 IBM Corporation
IBM Haifa Research Lab
Train Stopping on Red
B
A
The first train enters the tunnel.
The semaphore automatically turns the light to red.
Signalman A sends a message to signalman B, that a train is in the tunnel.
The second train approaches the tunnel and stops at the red light.
The first train exits the tunnel.
B sends a message to A, that the first train has exited the tunnel.
A sets the light to green.
The second train enters the tunnel.
7
© 2008 IBM Corporation
IBM Haifa Research Lab
Malfunctioning Semaphore
B
A
The first train enters the tunnel – the semaphore fails to turn the light red!
Signalman A sends a message to signalman B, that a train is in the tunnel.
The second train approaches the tunnel.
Signalman A runs to the track and signals to the train to stop and sets the light to red,
manually.
The first train exits the tunnel.
B sends a message to A, that the first train has exited the tunnel.
A sets the signal to green and lets the second train pass.
The second train enters and leaves the tunnel.
8
© 2008 IBM Corporation
IBM Haifa Research Lab
Clayton Tunnel Accident - 25 August 1861
B
A
The first train
theexits
tunnel
the semaphore fails to turn the light red!
Theenters
first train
the– tunnel.
SignalmanBAsends
sendsaamessage
messagetotoA,signalman
B, that
train isthe
in tunnel.
the tunnel.
that the train
hasaexited
SignalmanAAinterprets
runs to the
to set as
themeaning
light to red
signal
stop.
thetrack
message
thatand
both
trainsother
havetrains
exitedtothe
tunnel,
andtrain
he sets
the light
to green.
But a second
arrives
and enters
the tunnel before he has time to change the light to
red. The driver sees something but doesn’t have time to stop.
The second train driver decides to back out of the tunnel.
Signalman A sends another message to signalman B, that a train is in the tunnel.
The third train enters the tunnel.
The third train approaches the tunnel and stops.
The second and third trains collide.
The driver of the second train is suspicious that something is wrong and stops in the
middle of the tunnel.
9
© 2008 IBM Corporation
IBM Haifa Research Lab
Common Situations May Have Concurrent Aspects
10
© 2007 SHADOWS EU Framework
© 2008
Program
IBM Corporation
IBM Haifa Research Lab
Common Situations May Have Concurrent Aspects
11
© 2007 SHADOWS EU Framework
© 2008
Program
IBM Corporation
IBM Haifa Research Lab
Some Problems Are Ubiquitous
12
© 2007 SHADOWS EU Framework
© 2008
Program
IBM Corporation
IBM Haifa Research Lab
When the Problem is Common, the Solution is Part of
the Design
13
© 2007 SHADOWS EU Framework
© 2008
Program
IBM Corporation
IBM Haifa Research Lab
Bug Found by ConTest in WebSphere Site Analyzer
 Crawler, a ~1000 line Java component, is part of the
WebSphere Site Analyzer used to perform content analysis of
web sites
 Bug description

if (connection != null) connection.setStopFlag();
 Connection is checked to be !null
 CPU is lost
 Connection is set to null before CPU is regained

If this happens before connection.setStopFlag(); executes, an
exception is taken
 This bug was found while still testing ConTest
 This bug should (also) have been found in unit testing...
14
© 2008 IBM Corporation
IBM Haifa Research Lab
Initialization-Sleep
 One example is adding sleep() statements to
ensure that only the correct interleavings occur

15
Partial, non-consistent results are used by the thread
that assumes that initialization is done
© 2008 IBM Corporation
IBM Haifa Research Lab
Lost-Notify
 Losing notify: the notify is “lost” because it occurs
before the thread executes the wait() primitive
The gap was created because the programmer didn’t think the
notify would occur before the wait
Thread 1
Thread 2
synchronized (o){
o.notifyAll();
}
Synchronized (o){
o.wait();
}

16
© 2008 IBM Corporation
IBM Haifa Research Lab
Orphaned-Thread
 The tale of the orphaned thread

A single master thread drives actions of other threads

Messages are put on the queue by the master thread and
processed by the worker’s threads

Abnormal termination of the master thread results in the
remaining threads being orphaned

17
The system often blocks
© 2008 IBM Corporation
IBM Haifa Research Lab
Unintentional-Different-Thread
 A call to an API (typically a GUI API) is assumed to be
in the same thread
…. but is actually in a different thread
…….. causing the order of locking to sometimes
change
………… resulting in a deadlock

18
Nasty when you have a deadlock in a single thread program
© 2008 IBM Corporation
IBM Haifa Research Lab
Condition-For-Wait
thread can
also wake up
without being
 AMissing
condition
enclosing
thenotified,
wait interrupted, or
timing out, a so-called spurious wakeup. While this will rarely occur
When returning
frommust
a wait
theagainst
programmer
forgets
to
in– practice,
applications
guard
it by testing
for the
check,that
or checks
incorrectly
thethread
reason
forawakened,
which he and
condition
should have
causedifthe
to be
waited still
holds
continuing
to wait
if the condition is not satisfied. In other words,
waits
should1.5
always
occur
loops, to
liketerminate.
this one: Is your code
– In Java
a wait
canindecide
ready?
synchronized (obj) {
while (<condition does not hold>)

obj.wait(timeout); ... // Perform action appropriate
From
standard
to the
condition
http://java.sun.com/j2se/1.5.0/docs/api/java/lang/O
}
bject.html#wait(long)
19
© 2008 IBM Corporation
IBM Haifa Research Lab
Outline
 Why concurrent testing is so hard
– Some bug samples
 Do programmers know how to write concurrent programs?
– What you don’t know will hurt you!
 Testing – finding bugs or making them appear
 Debugging – one of the major problems of the field
 Healing – can you make the problem disappear or be a lot
less likely?
20
© 2008 IBM Corporation
IBM Haifa Research Lab
Riddle I: How Many Bugs Do You See?
public class Helloworld{
class Hello extends Thread{
public static void
main(String[] argv){
public void run(){
Hello helloThread = new
Hello();
World worldThread = new
World();
helloThread.start();
worldThread.start();
System.out.print("hello
");
}
}
class World extends Thread{
public void run(){
try{Thread.sleep(1000);}
catch(Exception exc){};
System.out.print("world"
);
System.out.print("\n");
}
}
21
}
}
© 2008 IBM Corporation
IBM Haifa Research Lab
Riddle II: Understanding Synchronization Primitives
 Locks protect the code segment and not the shared data
– Obtaining a lock when accessing the shared resources
 On an error path (e.g., an exception), does the system
release the lock?
 Consider the following class: class Conflict {
Conflict(…){ synchronized(Conflict.class){…}; };
void h(…){ synchronized(this){….};};
synchronized void g(…){….};
void r(…){…}; };
synchronized static void f(…){….};
 Can f || g, f || h, f || r, g || h, g || r, h || r cause a conflict?
22
© 2008 IBM Corporation
IBM Haifa Research Lab
Bug Published by the ConTest Project Team
23
© 2008 IBM Corporation
IBM Haifa Research Lab
Some Interesting Observations
 In practice, thread switches are few and far between
– The probability of finding the previous bug is low
– Synchronization usually operates as a no-op
• Removing all synchronizations does not usually impact testing results!
• Not knowing the synchronization primitives’ exact definition does not
impact testing but the program is incorrect
– Exception in synchronization: Do you still have the lock?
– What is the synchronization on?
 Thread scheduling for small applications is almost
deterministic in simple environment
– Each environment has its own interleaving
• Customer on the first day finds bugs in well tested applications!
24
© 2008 IBM Corporation
IBM Haifa Research Lab
Outline
 Why concurrent testing is so hard
– Some bug samples
 Do programmers know how to write concurrent programs?
– What you don’t know will hurt you!
 Testing – finding bugs or making them appear
 Debugging – one of the major problems of the field
 Healing – can you make the problem disappear or be a lot
less likely?
25
© 2008 IBM Corporation
IBM Haifa Research Lab
How Does a Noise Maker Work?
 Find places that are “concurrently interesting”
– Places whose relative interleaving may change the result of the
program: access to shared variables, synch primitives
– Possibly use bug patterns to identify such places
• Lost notify
• Sleep()
• Condition before synchronization
 Modify the program by adding interleaving changing
mechanisms
– Usually sleep(), yield() but more advanced mechanisms exist
• Make sure no new bug is introduced
– Make sure that the interleaving change is random
Download a noise maker:
http://www.alphaworks.ibm.com/tech/contest
26
© 2008 IBM Corporation
IBM Haifa Research Lab
With ConTest, Each Test Goes a Long Way
27
© 2008 IBM Corporation
IBM Haifa Research Lab
Measuring Concurrent Coverage
 ConTest also measures coverage
– The feature that people wanted most is not to find bugs 
 How do you know if the tests are any good from the
view of concurrency?
 Synchronization coverage
– Make sure that every synchronization primitive did something
See Applications of synchronization coverage
28
© 2008 IBM Corporation
IBM Haifa Research Lab
Unit Testing Concurrent Code with ConTest
1. Remove non-relevant code (optional)
2. Create a number of tests
3. Apply ConTest to the code
4. Run each test multiple time and measure
synchronization coverage
1. If a bug is found, use ConTest to debug, fix, go to 3
2. If deadlock violation found, fix, go to 3
5. If coverage insufficient, analyze
1. Need to rerun tests, go to 4
2. Need to write new tests, go to 2
6. Done
29
© 2008 IBM Corporation
IBM Haifa Research Lab
Function and System Test with ConTest
 Increased efficiency in finding intermittent bugs
 Negligible change to process in Java (in C
recompilation is required)
–
Installed in WebSphere size projects in a day
–
Works under any test automation used (e.g., Rational
tools)
 Additional benefits
30
–
Coverage measurements
–
Aids in debugging
© 2008 IBM Corporation
IBM Haifa Research Lab
Delaying Messages with ConTest/UDP
Application
Application
Mode = Delay
Direction = Out
3
1
Tool
ConTest/UDP
Tool
ConTest/UDP
Java
Java
Java
Java
UDP
UDP
UDP
UDP
IP
Link
Link
B
31
2
Application
Network
IP
Link
Link
A
© 2008 IBM Corporation
IBM Haifa Research Lab
Delaying Messages with ConTest/UDP
1
2
3
4
Application
Mode = No-Noise
Application
Application
4
3
1
Tool
ConTest/UDP
ConTest/UDP
Java
Java
Java
Java
UDP
UDP
UDP
UDP
IP
Link
Link
B
32
2
Network
IP
Link
Link
A
© 2008 IBM Corporation
IBM Haifa Research Lab
Losing Messages with ConTest/UDP
Application
Application
Mode = Block
Direction = Out
XX X
3
Tool
ConTest/UDP
2
1
ConTest/UDP
Java
Java
Java
Java
UDP
UDP
UDP
UDP
IP
Link
Link
B
33
Application
Network
IP
Link
Link
A
© 2008 IBM Corporation
IBM Haifa Research Lab
Losing Messages with ConTest/UDP
4
Application
Mode = No-Noise
Application
Application
XX X
3
Tool
ConTest/UDP
1
ConTest/UDP
Java
Java
Java
Java
UDP
UDP
UDP
UDP
IP
Link
Link
B
34
2
4
Network
IP
Link
Link
A
© 2008 IBM Corporation
IBM Haifa Research Lab
Many Other Testing Tools/Technologies
35
© 2008 IBM Corporation
IBM Haifa Research Lab
Outline
 Why concurrent testing is so hard
– Some bug samples
 Do programmers know how to write concurrent programs?
– What you don’t know will hurt you!
 Testing – finding bugs or making them appear
 Debugging – one of the major problems of the field
 Healing – can you make the problem disappear or be a lot
less likely?
36
© 2008 IBM Corporation
IBM Haifa Research Lab
What is Special about Concurrent Debugging?
 Debugging is difficult – very few good concurrent
debuggers
 Debugging sometimes makes the bug disappear…
– Same for print statements
 Don’t even know the state of the program
 Test fail at the customer/testers – not at the developer
– Some just happen very rarely
 Need to look at a lot of things
– Cause problems for reviews as well
37
© 2008 IBM Corporation
IBM Haifa Research Lab
Debugging Toolbox
 Lock discipline violation
 Orange_box – remember {last_values_size} of values
and locations for each variable
 Replay
 Talk with (socket/keyboard) ConTest while the
application is running
 Race detection
 Automatic debugging
38
© 2008 IBM Corporation
IBM Haifa Research Lab
Automatic Debugging as Feature Selection
 Think of each instrumentation point as a feature
 Execute the program many times with different subset
– Each instrumentation point gets a score
– P(i) = P(success|Xi)/P(!success|Xi)
 There are false negatives and positives
 The nodes with the highest score may be related to
location of failure
39
© 2008 IBM Corporation
IBM Haifa Research Lab
First Largish Program
40
© 2008 IBM Corporation
IBM Haifa Research Lab
Does Not Work When the Problem is Easy
41
© 2008 IBM Corporation
IBM Haifa Research Lab
Differences on the First Large Program
42
© 2008 IBM Corporation
IBM Haifa Research Lab
Differences on the Second Large Program
43
© 2008 IBM Corporation
IBM Haifa Research Lab
Outline
 Why concurrent testing is so hard
– Some bug samples
 Do programmers know how to write concurrent programs?
– What you don’t know will hurt you!
 Testing – finding bugs or making them appear
 Debugging – one of the major problems of the field
 Healing – can you make the problem disappear or be a lot
less likely?
44
© 2008 IBM Corporation
IBM Haifa Research Lab
Healing Concurrent Programs
 In testing, we try to make them more likely to fail
 In the field, we may want to make them less likely to
fail
 General problem statement: the program sometimes
fails (depending on the interleaving)
– General solution: eliminate, or make the interleaving the lead to
bugs less likely; but do not cause deadlocks
 We do it for races, deadlocks …
– Not quite as useful but very interesting
– Done as a project called SHADOWS in the EU (with Brno)
45
© 2008 IBM Corporation
IBM Haifa Research Lab
Classic Deadlock Situation
Thread 1
lock A
46
Thread 2
Deadlock
lock B
lock B
lock A
release B
release A
release A
release B
© 2008 IBM Corporation
IBM Haifa Research Lab
Lock Discipline Violation Detection
Thread 1
Thread 2
lock A
lock B
 Build a directed graph of lock acquisitions
A
B
lock B
lock A
release B
release A
release A
release B
 Unguarded cycles are potential deadlocks
– No common gate lock is held while the locks in the cycle are held
 Even when not real deadlocks, prone to become ones as the code
evolves
 Cross run lock discipline violation detection [Ur et. al 05]:
– Identify locks across different runs according to program location
47
© 2008 IBM Corporation
IBM Haifa Research Lab
Our Deadlock Healing
 For each unguarded SCC, add a wrapper (gate) lock
– Taken before any lock in the SCC is taken
– Released after all locks in the SCC are released
 Very easy to implement
– Does not require run-time monitoring
– Could be implemented by code change
– However, some complications exist…
48
© 2008 IBM Corporation
IBM Haifa Research Lab
The Incomplete Input Problem
A
B
C
49
© 2008 IBM Corporation
IBM Haifa Research Lab
The Incomplete Input Problem
Thread 1
No
Deadlock
Deadlock
!
Thread 2
lock AB
lock A
lock C
lock AB
50
lock C
lock B
release C
release B
release A
release C
© 2008 IBM Corporation
IBM Haifa Research Lab
Solution to the Incomplete Input Problem
 Maintain a dynamic lock graph of lock requests and
acquisitions
 Once an SCC containing a wrapper lock is detected in
the dynamic graph
– Cancel the wrapper lock and abort healing (but do not abort
run!)
• Wrapper locks can be implemented by semaphores
– Update the static lock graph with the new SCC
51
© 2008 IBM Corporation
IBM Haifa Research Lab
Implementation and Results
 Implemented as a Listener on top of ConTest, including
handling incomplete input and calls to wait
 Ran on Telefonica’s TIDOrbJ, Java 1.4 Collection Library,
and NASA’s Ames K9 Rover Executive with 100% healing
success
 Performance overhead: 15% for Rover, 30% for Java 1.4
Collection Library
– Possible improvements: use online cycle detection algorithms,
decompose SCCs
 Adopted by Microsoft in their concurrency unit testing tool
CHESS
52
© 2008 IBM Corporation
IBM Haifa Research Lab
Interesting Open Problems
 Work on combining static and dynamic
– When a static tool finds a bug (beware of false positive),
instead of telling the user - try to make it happen
– Maybe the static tool needs a hint from dynamic before you tell
the user (if x can be greater than zero)
– The problem space is too big, find an area that might have a
problem with dynamic and verify it with a model checker –
testing based reductions
53
© 2008 IBM Corporation
IBM Haifa Research Lab
Additional Interesting Problems
 Find redundant safety features
– How to remove synchronization safely
 Better evaluation of the quality of the testing done
 Support for collection of information that will make
debugging more efficient
54
© 2008 IBM Corporation
IBM Haifa Research Lab
Useful Infrastructure
Static Analysis
Debugging
Replay
On-line Race Detection
Static
Dynamic
Noise Making
Formal Verification
State Space Exploration
Observation
Database
Off-line Race Detection
Coverage
Trace Evaluation
Cloning
Instrumentation
Engine
55
Performance Monitoring
© 2008 IBM Corporation
IBM Haifa Research Lab
56
© 2008 IBM Corporation
IBM Haifa Research Lab
Atomicity is Never Ensured
static void transfer(Transfer t) {
balances[t.fundFrom] -= t.amount;
balances[t.fundTo] += t.amount;
}
 Expected behavior
Money should pass from one account to another
 Observed behavior
Sometimes the amount taken is not equal to the amount received
 Possible bug
Thread switch in the middle of money transfers
57
© 2008 IBM Corporation
IBM Haifa Research Lab
Atomicity is Never Ensured II
 Assume Atomic transfer (can't be implemented locally in
Java)
balances[t.fundFrom] -= t.amount;
balances[t.fundTo] += t.amount;
 Assume a counting loop that checks
if the total remains the same
 Does it work now?
…
$5000
$2500
…
…
…
$4500
58
© 2008 IBM Corporation
IBM Haifa Research Lab
Wrong/No-Lock
Wrong lock or no lock
 Protection of Thread 1 does not apply to Thread 2
 An access protocol is not followed due to


A new team member
An attempt to improve performance
Thread 1
Thread 2
Synchronized (o){
x++;
x++;
}
59
© 2008 IBM Corporation