Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
IBM Haifa Research Lab Concurrent Bugs How to Eliminate Them or Render Them Harmless Shmuel Ur © 2008 IBM Corporation IBM Haifa Research Lab Outline Why concurrent testing is so hard – Some bug samples Do programmers know how to write concurrent programs? – What you don’t know will hurt you! Testing – finding bugs or making them appear Debugging – one of the major problems of the field Healing – can you make the problem disappear or be a lot less likely? 2 © 2008 IBM Corporation IBM Haifa Research Lab Why is Concurrent Testing Hard? Concurrency introduces non-determinism – Multiple executions of the same test may have different interleavings (and different results!) • An interleaving is the relative execution order of the program threads – Very hard to reproduce and debug No useful coverage measures for the interleaving space Typically appear only in specific configurations – Therefore commonly found by users – Require large configurations to test Represent only ~10% of the bugs but an unproportionate number are found late or by the customer -> Very expensive The costly effort of testing concurrency at the system level is seemingly unavoidable 3 © 2008 IBM Corporation IBM Haifa Research Lab Classic Java Counter Example Consider the function increase(), which is a part of a class that acts as a counter public void increase() { counter++; } Although written as a single “increase” operation, the “++” operator is actually mapped into three JVM instructions [load operand, increment, write-back] 4 © 2008 IBM Corporation IBM Haifa Research Lab Counter Example, continued Thread A … Load op Thread B Context Switch increment write-back … … load op increment write-back … counter = 4 3 5 © 2008 IBM Corporation IBM Haifa Research Lab Simple Protocol B A The train enters the tunnel. The semaphore automatically turns the light to red. Signalman A sends a message to signalman B that a train is in the tunnel. The train exits the tunnel. Signalman B sends a message to signalman A, that the train has exited the tunnel. Signalman A sets the light to green. 6 © 2008 IBM Corporation IBM Haifa Research Lab Train Stopping on Red B A The first train enters the tunnel. The semaphore automatically turns the light to red. Signalman A sends a message to signalman B, that a train is in the tunnel. The second train approaches the tunnel and stops at the red light. The first train exits the tunnel. B sends a message to A, that the first train has exited the tunnel. A sets the light to green. The second train enters the tunnel. 7 © 2008 IBM Corporation IBM Haifa Research Lab Malfunctioning Semaphore B A The first train enters the tunnel – the semaphore fails to turn the light red! Signalman A sends a message to signalman B, that a train is in the tunnel. The second train approaches the tunnel. Signalman A runs to the track and signals to the train to stop and sets the light to red, manually. The first train exits the tunnel. B sends a message to A, that the first train has exited the tunnel. A sets the signal to green and lets the second train pass. The second train enters and leaves the tunnel. 8 © 2008 IBM Corporation IBM Haifa Research Lab Clayton Tunnel Accident - 25 August 1861 B A The first train theexits tunnel the semaphore fails to turn the light red! Theenters first train the– tunnel. SignalmanBAsends sendsaamessage messagetotoA,signalman B, that train isthe in tunnel. the tunnel. that the train hasaexited SignalmanAAinterprets runs to the to set as themeaning light to red signal stop. thetrack message thatand both trainsother havetrains exitedtothe tunnel, andtrain he sets the light to green. But a second arrives and enters the tunnel before he has time to change the light to red. The driver sees something but doesn’t have time to stop. The second train driver decides to back out of the tunnel. Signalman A sends another message to signalman B, that a train is in the tunnel. The third train enters the tunnel. The third train approaches the tunnel and stops. The second and third trains collide. The driver of the second train is suspicious that something is wrong and stops in the middle of the tunnel. 9 © 2008 IBM Corporation IBM Haifa Research Lab Common Situations May Have Concurrent Aspects 10 © 2007 SHADOWS EU Framework © 2008 Program IBM Corporation IBM Haifa Research Lab Common Situations May Have Concurrent Aspects 11 © 2007 SHADOWS EU Framework © 2008 Program IBM Corporation IBM Haifa Research Lab Some Problems Are Ubiquitous 12 © 2007 SHADOWS EU Framework © 2008 Program IBM Corporation IBM Haifa Research Lab When the Problem is Common, the Solution is Part of the Design 13 © 2007 SHADOWS EU Framework © 2008 Program IBM Corporation IBM Haifa Research Lab Bug Found by ConTest in WebSphere Site Analyzer Crawler, a ~1000 line Java component, is part of the WebSphere Site Analyzer used to perform content analysis of web sites Bug description if (connection != null) connection.setStopFlag(); Connection is checked to be !null CPU is lost Connection is set to null before CPU is regained If this happens before connection.setStopFlag(); executes, an exception is taken This bug was found while still testing ConTest This bug should (also) have been found in unit testing... 14 © 2008 IBM Corporation IBM Haifa Research Lab Initialization-Sleep One example is adding sleep() statements to ensure that only the correct interleavings occur 15 Partial, non-consistent results are used by the thread that assumes that initialization is done © 2008 IBM Corporation IBM Haifa Research Lab Lost-Notify Losing notify: the notify is “lost” because it occurs before the thread executes the wait() primitive The gap was created because the programmer didn’t think the notify would occur before the wait Thread 1 Thread 2 synchronized (o){ o.notifyAll(); } Synchronized (o){ o.wait(); } 16 © 2008 IBM Corporation IBM Haifa Research Lab Orphaned-Thread The tale of the orphaned thread A single master thread drives actions of other threads Messages are put on the queue by the master thread and processed by the worker’s threads Abnormal termination of the master thread results in the remaining threads being orphaned 17 The system often blocks © 2008 IBM Corporation IBM Haifa Research Lab Unintentional-Different-Thread A call to an API (typically a GUI API) is assumed to be in the same thread …. but is actually in a different thread …….. causing the order of locking to sometimes change ………… resulting in a deadlock 18 Nasty when you have a deadlock in a single thread program © 2008 IBM Corporation IBM Haifa Research Lab Condition-For-Wait thread can also wake up without being AMissing condition enclosing thenotified, wait interrupted, or timing out, a so-called spurious wakeup. While this will rarely occur When returning frommust a wait theagainst programmer forgets to in– practice, applications guard it by testing for the check,that or checks incorrectly thethread reason forawakened, which he and condition should have causedifthe to be waited still holds continuing to wait if the condition is not satisfied. In other words, waits should1.5 always occur loops, to liketerminate. this one: Is your code – In Java a wait canindecide ready? synchronized (obj) { while (<condition does not hold>) obj.wait(timeout); ... // Perform action appropriate From standard to the condition http://java.sun.com/j2se/1.5.0/docs/api/java/lang/O } bject.html#wait(long) 19 © 2008 IBM Corporation IBM Haifa Research Lab Outline Why concurrent testing is so hard – Some bug samples Do programmers know how to write concurrent programs? – What you don’t know will hurt you! Testing – finding bugs or making them appear Debugging – one of the major problems of the field Healing – can you make the problem disappear or be a lot less likely? 20 © 2008 IBM Corporation IBM Haifa Research Lab Riddle I: How Many Bugs Do You See? public class Helloworld{ class Hello extends Thread{ public static void main(String[] argv){ public void run(){ Hello helloThread = new Hello(); World worldThread = new World(); helloThread.start(); worldThread.start(); System.out.print("hello "); } } class World extends Thread{ public void run(){ try{Thread.sleep(1000);} catch(Exception exc){}; System.out.print("world" ); System.out.print("\n"); } } 21 } } © 2008 IBM Corporation IBM Haifa Research Lab Riddle II: Understanding Synchronization Primitives Locks protect the code segment and not the shared data – Obtaining a lock when accessing the shared resources On an error path (e.g., an exception), does the system release the lock? Consider the following class: class Conflict { Conflict(…){ synchronized(Conflict.class){…}; }; void h(…){ synchronized(this){….};}; synchronized void g(…){….}; void r(…){…}; }; synchronized static void f(…){….}; Can f || g, f || h, f || r, g || h, g || r, h || r cause a conflict? 22 © 2008 IBM Corporation IBM Haifa Research Lab Bug Published by the ConTest Project Team 23 © 2008 IBM Corporation IBM Haifa Research Lab Some Interesting Observations In practice, thread switches are few and far between – The probability of finding the previous bug is low – Synchronization usually operates as a no-op • Removing all synchronizations does not usually impact testing results! • Not knowing the synchronization primitives’ exact definition does not impact testing but the program is incorrect – Exception in synchronization: Do you still have the lock? – What is the synchronization on? Thread scheduling for small applications is almost deterministic in simple environment – Each environment has its own interleaving • Customer on the first day finds bugs in well tested applications! 24 © 2008 IBM Corporation IBM Haifa Research Lab Outline Why concurrent testing is so hard – Some bug samples Do programmers know how to write concurrent programs? – What you don’t know will hurt you! Testing – finding bugs or making them appear Debugging – one of the major problems of the field Healing – can you make the problem disappear or be a lot less likely? 25 © 2008 IBM Corporation IBM Haifa Research Lab How Does a Noise Maker Work? Find places that are “concurrently interesting” – Places whose relative interleaving may change the result of the program: access to shared variables, synch primitives – Possibly use bug patterns to identify such places • Lost notify • Sleep() • Condition before synchronization Modify the program by adding interleaving changing mechanisms – Usually sleep(), yield() but more advanced mechanisms exist • Make sure no new bug is introduced – Make sure that the interleaving change is random Download a noise maker: http://www.alphaworks.ibm.com/tech/contest 26 © 2008 IBM Corporation IBM Haifa Research Lab With ConTest, Each Test Goes a Long Way 27 © 2008 IBM Corporation IBM Haifa Research Lab Measuring Concurrent Coverage ConTest also measures coverage – The feature that people wanted most is not to find bugs How do you know if the tests are any good from the view of concurrency? Synchronization coverage – Make sure that every synchronization primitive did something See Applications of synchronization coverage 28 © 2008 IBM Corporation IBM Haifa Research Lab Unit Testing Concurrent Code with ConTest 1. Remove non-relevant code (optional) 2. Create a number of tests 3. Apply ConTest to the code 4. Run each test multiple time and measure synchronization coverage 1. If a bug is found, use ConTest to debug, fix, go to 3 2. If deadlock violation found, fix, go to 3 5. If coverage insufficient, analyze 1. Need to rerun tests, go to 4 2. Need to write new tests, go to 2 6. Done 29 © 2008 IBM Corporation IBM Haifa Research Lab Function and System Test with ConTest Increased efficiency in finding intermittent bugs Negligible change to process in Java (in C recompilation is required) – Installed in WebSphere size projects in a day – Works under any test automation used (e.g., Rational tools) Additional benefits 30 – Coverage measurements – Aids in debugging © 2008 IBM Corporation IBM Haifa Research Lab Delaying Messages with ConTest/UDP Application Application Mode = Delay Direction = Out 3 1 Tool ConTest/UDP Tool ConTest/UDP Java Java Java Java UDP UDP UDP UDP IP Link Link B 31 2 Application Network IP Link Link A © 2008 IBM Corporation IBM Haifa Research Lab Delaying Messages with ConTest/UDP 1 2 3 4 Application Mode = No-Noise Application Application 4 3 1 Tool ConTest/UDP ConTest/UDP Java Java Java Java UDP UDP UDP UDP IP Link Link B 32 2 Network IP Link Link A © 2008 IBM Corporation IBM Haifa Research Lab Losing Messages with ConTest/UDP Application Application Mode = Block Direction = Out XX X 3 Tool ConTest/UDP 2 1 ConTest/UDP Java Java Java Java UDP UDP UDP UDP IP Link Link B 33 Application Network IP Link Link A © 2008 IBM Corporation IBM Haifa Research Lab Losing Messages with ConTest/UDP 4 Application Mode = No-Noise Application Application XX X 3 Tool ConTest/UDP 1 ConTest/UDP Java Java Java Java UDP UDP UDP UDP IP Link Link B 34 2 4 Network IP Link Link A © 2008 IBM Corporation IBM Haifa Research Lab Many Other Testing Tools/Technologies 35 © 2008 IBM Corporation IBM Haifa Research Lab Outline Why concurrent testing is so hard – Some bug samples Do programmers know how to write concurrent programs? – What you don’t know will hurt you! Testing – finding bugs or making them appear Debugging – one of the major problems of the field Healing – can you make the problem disappear or be a lot less likely? 36 © 2008 IBM Corporation IBM Haifa Research Lab What is Special about Concurrent Debugging? Debugging is difficult – very few good concurrent debuggers Debugging sometimes makes the bug disappear… – Same for print statements Don’t even know the state of the program Test fail at the customer/testers – not at the developer – Some just happen very rarely Need to look at a lot of things – Cause problems for reviews as well 37 © 2008 IBM Corporation IBM Haifa Research Lab Debugging Toolbox Lock discipline violation Orange_box – remember {last_values_size} of values and locations for each variable Replay Talk with (socket/keyboard) ConTest while the application is running Race detection Automatic debugging 38 © 2008 IBM Corporation IBM Haifa Research Lab Automatic Debugging as Feature Selection Think of each instrumentation point as a feature Execute the program many times with different subset – Each instrumentation point gets a score – P(i) = P(success|Xi)/P(!success|Xi) There are false negatives and positives The nodes with the highest score may be related to location of failure 39 © 2008 IBM Corporation IBM Haifa Research Lab First Largish Program 40 © 2008 IBM Corporation IBM Haifa Research Lab Does Not Work When the Problem is Easy 41 © 2008 IBM Corporation IBM Haifa Research Lab Differences on the First Large Program 42 © 2008 IBM Corporation IBM Haifa Research Lab Differences on the Second Large Program 43 © 2008 IBM Corporation IBM Haifa Research Lab Outline Why concurrent testing is so hard – Some bug samples Do programmers know how to write concurrent programs? – What you don’t know will hurt you! Testing – finding bugs or making them appear Debugging – one of the major problems of the field Healing – can you make the problem disappear or be a lot less likely? 44 © 2008 IBM Corporation IBM Haifa Research Lab Healing Concurrent Programs In testing, we try to make them more likely to fail In the field, we may want to make them less likely to fail General problem statement: the program sometimes fails (depending on the interleaving) – General solution: eliminate, or make the interleaving the lead to bugs less likely; but do not cause deadlocks We do it for races, deadlocks … – Not quite as useful but very interesting – Done as a project called SHADOWS in the EU (with Brno) 45 © 2008 IBM Corporation IBM Haifa Research Lab Classic Deadlock Situation Thread 1 lock A 46 Thread 2 Deadlock lock B lock B lock A release B release A release A release B © 2008 IBM Corporation IBM Haifa Research Lab Lock Discipline Violation Detection Thread 1 Thread 2 lock A lock B Build a directed graph of lock acquisitions A B lock B lock A release B release A release A release B Unguarded cycles are potential deadlocks – No common gate lock is held while the locks in the cycle are held Even when not real deadlocks, prone to become ones as the code evolves Cross run lock discipline violation detection [Ur et. al 05]: – Identify locks across different runs according to program location 47 © 2008 IBM Corporation IBM Haifa Research Lab Our Deadlock Healing For each unguarded SCC, add a wrapper (gate) lock – Taken before any lock in the SCC is taken – Released after all locks in the SCC are released Very easy to implement – Does not require run-time monitoring – Could be implemented by code change – However, some complications exist… 48 © 2008 IBM Corporation IBM Haifa Research Lab The Incomplete Input Problem A B C 49 © 2008 IBM Corporation IBM Haifa Research Lab The Incomplete Input Problem Thread 1 No Deadlock Deadlock ! Thread 2 lock AB lock A lock C lock AB 50 lock C lock B release C release B release A release C © 2008 IBM Corporation IBM Haifa Research Lab Solution to the Incomplete Input Problem Maintain a dynamic lock graph of lock requests and acquisitions Once an SCC containing a wrapper lock is detected in the dynamic graph – Cancel the wrapper lock and abort healing (but do not abort run!) • Wrapper locks can be implemented by semaphores – Update the static lock graph with the new SCC 51 © 2008 IBM Corporation IBM Haifa Research Lab Implementation and Results Implemented as a Listener on top of ConTest, including handling incomplete input and calls to wait Ran on Telefonica’s TIDOrbJ, Java 1.4 Collection Library, and NASA’s Ames K9 Rover Executive with 100% healing success Performance overhead: 15% for Rover, 30% for Java 1.4 Collection Library – Possible improvements: use online cycle detection algorithms, decompose SCCs Adopted by Microsoft in their concurrency unit testing tool CHESS 52 © 2008 IBM Corporation IBM Haifa Research Lab Interesting Open Problems Work on combining static and dynamic – When a static tool finds a bug (beware of false positive), instead of telling the user - try to make it happen – Maybe the static tool needs a hint from dynamic before you tell the user (if x can be greater than zero) – The problem space is too big, find an area that might have a problem with dynamic and verify it with a model checker – testing based reductions 53 © 2008 IBM Corporation IBM Haifa Research Lab Additional Interesting Problems Find redundant safety features – How to remove synchronization safely Better evaluation of the quality of the testing done Support for collection of information that will make debugging more efficient 54 © 2008 IBM Corporation IBM Haifa Research Lab Useful Infrastructure Static Analysis Debugging Replay On-line Race Detection Static Dynamic Noise Making Formal Verification State Space Exploration Observation Database Off-line Race Detection Coverage Trace Evaluation Cloning Instrumentation Engine 55 Performance Monitoring © 2008 IBM Corporation IBM Haifa Research Lab 56 © 2008 IBM Corporation IBM Haifa Research Lab Atomicity is Never Ensured static void transfer(Transfer t) { balances[t.fundFrom] -= t.amount; balances[t.fundTo] += t.amount; } Expected behavior Money should pass from one account to another Observed behavior Sometimes the amount taken is not equal to the amount received Possible bug Thread switch in the middle of money transfers 57 © 2008 IBM Corporation IBM Haifa Research Lab Atomicity is Never Ensured II Assume Atomic transfer (can't be implemented locally in Java) balances[t.fundFrom] -= t.amount; balances[t.fundTo] += t.amount; Assume a counting loop that checks if the total remains the same Does it work now? … $5000 $2500 … … … $4500 58 © 2008 IBM Corporation IBM Haifa Research Lab Wrong/No-Lock Wrong lock or no lock Protection of Thread 1 does not apply to Thread 2 An access protocol is not followed due to A new team member An attempt to improve performance Thread 1 Thread 2 Synchronized (o){ x++; x++; } 59 © 2008 IBM Corporation