Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
[HSC-1108] Exceptions in Task initialization don't result in MPI abort Created: 05/Dec/14 Updated: 08/Dec/14 Resolved: 08/Dec/14 Status: Project: Component/s: Affects Version/s: Fix Version/s: Done HSC Data Management hscPipe None Type: Reporter: Resolution: Labels: Remaining Estimate: Time Spent: Original Estimate: Story Claire Lackner Done MPI, batch Not Specified Reviewers: Claire Lackner None Priority: Assignee: Votes: Major Paul Price 0 Not Specified Not Specified Description Clair writes: I'm running our fake galaxy injection sometimes with reduceFrames.py and sometimes with hscProcessCcd.py, and I came across some weird behavior. Basically, I mis-specified a configuration parameter (a filename for the catalog of fake sources to add). When I run it in hscProcessCcd, the code throws an IOError and stops. When I run it in reduce frames, the code still throws the error, but the job remains stuck in the queue. Here's a mini code snippet: class myFakes(FakeSourceTask): ConfigClass = myFakesConfig def __init__(self, **kwargs): FakeSourcesTask.__init__(self, **kwargs) with open(self.config.galFile) as fp: self.galData = fits.open(self.config.galFile)[1].data So, if the filename config.galFile is wrong (oops), this throws an IOError when myFakes is initialized the first time (it happens before the CCD processing starts, I don't know why). When running reduceFrames, you can see this error in the output for the processing of the top-level job (JOBNAME.JOBNUMBER file), but the job doesn't quit then, it just hangs in the queue until timing out. Is there a way I should be handling the errors so they work in torque? The job should just die at this point if it doesn't have the file. Comments Comment by Paul Price [ 05/Dec/14 ] The problem is that the parseAndRun method isn't protected by an abortOnError decorator. Comment by Paul Price [ 05/Dec/14 ] Claire, could you try out this fix, please? It's on branch u/price/ of hscPipeBase. price@price-laptop:~/hsc/hscPipeBase (u/price/HSC-1108 $=) $ git --no-pager log --stat --reverse origin/master.. commit 012c416c84faa379888d6715c33640542ba80c20 Author: Paul Price <[email protected]> Date: Fri Dec 5 08:28:06 2014 -0500 BatchPoolTask: protect parseAndRun with @abortOnError An (uncaught) exception in the Task instantiation would kill the process without bringing down the MPI framework. Protecting parseAndRun with an @abortOnError means that exception will now be caught and a proper MPI abort issued. python/hsc/pipe/base/parallel.py | 3 ++1 file changed, 2 insertions(+), 1 deletion(-) Comment by Claire Lackner [ 08/Dec/14 ] It looks like this fix doesn't work. Now reduceFrames blows up right away: Traceback (most recent call last): File "/home/bot/sandbox/bbot/hscPipe/bin/reduceFrames.py", line 2, in <module> from hsc.pipe.tasks.processExposure import ProcessExposureTask File "/home/bot/sandbox/bbot/hscPipe/python/hsc/pipe/tasks/processExposure.py", line 15, in <module> from hsc.pipe.tasks.focusTask import ProcessFocusTask File "/home/bot/sandbox/bbot/hscPipe/python/hsc/pipe/tasks/focusTask.py", line 16, in <module> from hsc.pipe.base.parallel import BatchPoolTask File "/home/bick/hscPipeBase/python/hsc/pipe/base/parallel.py", line 383, in <module> class BatchPoolTask(BatchCmdLineTask): File "/home/bick/hscPipeBase/python/hsc/pipe/base/parallel.py", line 385, in BatchPoolTask @classmethod File "/home/bick/hscPipeBase/python/hsc/pipe/base/pool.py", line 80, in abortOnError @wraps(func) File "/data1a/ana/products2014/Linux64/python/2.7.6/lib/python2.7/functools.py", line 33, in update_wrapper setattr(wrapper, attr, getattr(wrapped, attr)) AttributeError: 'classmethod' object has no attribute '__module__' The problem is explained here. It seems like functools.wraps expects a function to have a '_name' and a 'module_', which classmethods don't have. The easy fix is to switch the order of the decorators for the parseAndRun method to: @classmethod @abortOnError I've tested that, and it dies on the error as expected, and runs fine when there is no error in the config. Comment by Paul Price [ 08/Dec/14 ] Thanks for fixing that! Merged this to master. It should be in the next release. Generated at Fri May 12 10:44:06 EDT 2017 using JIRA 6.2.2#6258sha1:65ffb4362589622c100f6488635539584b0f7b98.