Download [#HSC-1108] Exceptions in Task initialization don`t result in MPI abort

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
[HSC-1108] Exceptions in Task initialization don't result in MPI abort Created:
05/Dec/14 Updated: 08/Dec/14 Resolved: 08/Dec/14
Status:
Project:
Component/s:
Affects
Version/s:
Fix Version/s:
Done
HSC Data Management
hscPipe
None
Type:
Reporter:
Resolution:
Labels:
Remaining
Estimate:
Time Spent:
Original
Estimate:
Story
Claire Lackner
Done
MPI, batch
Not Specified
Reviewers:
Claire Lackner
None
Priority:
Assignee:
Votes:
Major
Paul Price
0
Not Specified
Not Specified
Description
Clair writes:
I'm running our fake galaxy injection sometimes with reduceFrames.py and
sometimes with hscProcessCcd.py, and I came across some weird behavior.
Basically, I mis-specified a configuration parameter (a filename for the catalog
of fake sources to add). When I run it in hscProcessCcd, the code throws an
IOError and stops. When I run it in reduce frames, the code still throws the error,
but the job remains stuck in the queue. Here's a mini code snippet:
class myFakes(FakeSourceTask):
ConfigClass = myFakesConfig
def __init__(self, **kwargs):
FakeSourcesTask.__init__(self, **kwargs)
with open(self.config.galFile) as fp:
self.galData =
fits.open(self.config.galFile)[1].data
So, if the filename config.galFile is wrong (oops), this throws an IOError when
myFakes is initialized the first time (it happens before the CCD processing starts,
I don't know why). When running reduceFrames, you can see this error in the
output for the processing of the top-level job (JOBNAME.JOBNUMBER file),
but the job doesn't quit then, it just hangs in the queue until timing out. Is there a
way I should be handling the errors so they work in torque? The job should just
die at this point if it doesn't have the file.
Comments
Comment by Paul Price [ 05/Dec/14 ]
The problem is that the parseAndRun method isn't protected by an abortOnError decorator.
Comment by Paul Price [ 05/Dec/14 ]
Claire, could you try out this fix, please? It's on branch u/price/ of hscPipeBase.
price@price-laptop:~/hsc/hscPipeBase (u/price/HSC-1108 $=) $ git --no-pager
log --stat --reverse origin/master..
commit 012c416c84faa379888d6715c33640542ba80c20
Author: Paul Price <[email protected]>
Date:
Fri Dec 5 08:28:06 2014 -0500
BatchPoolTask: protect parseAndRun with @abortOnError
An (uncaught) exception in the Task instantiation would kill the
process
without bringing down the MPI framework. Protecting parseAndRun with
an @abortOnError means that exception will now be caught and a proper
MPI abort issued.
python/hsc/pipe/base/parallel.py | 3 ++1 file changed, 2 insertions(+), 1 deletion(-)
Comment by Claire Lackner [ 08/Dec/14 ]
It looks like this fix doesn't work. Now reduceFrames blows up right away:
Traceback (most recent call last):
File "/home/bot/sandbox/bbot/hscPipe/bin/reduceFrames.py", line 2, in
<module>
from hsc.pipe.tasks.processExposure import ProcessExposureTask
File
"/home/bot/sandbox/bbot/hscPipe/python/hsc/pipe/tasks/processExposure.py",
line 15, in <module>
from hsc.pipe.tasks.focusTask import ProcessFocusTask
File "/home/bot/sandbox/bbot/hscPipe/python/hsc/pipe/tasks/focusTask.py",
line 16, in <module>
from hsc.pipe.base.parallel import BatchPoolTask
File "/home/bick/hscPipeBase/python/hsc/pipe/base/parallel.py", line 383,
in <module>
class BatchPoolTask(BatchCmdLineTask):
File "/home/bick/hscPipeBase/python/hsc/pipe/base/parallel.py", line 385,
in BatchPoolTask
@classmethod
File "/home/bick/hscPipeBase/python/hsc/pipe/base/pool.py", line 80, in
abortOnError
@wraps(func)
File
"/data1a/ana/products2014/Linux64/python/2.7.6/lib/python2.7/functools.py",
line 33, in update_wrapper
setattr(wrapper, attr, getattr(wrapped, attr))
AttributeError: 'classmethod' object has no attribute '__module__'
The problem is explained here. It seems like functools.wraps expects a function to have a
'_name' and a 'module_', which classmethods don't have. The easy fix is to switch the order of
the decorators for the parseAndRun method to:
@classmethod
@abortOnError
I've tested that, and it dies on the error as expected, and runs fine when there is no error in the
config.
Comment by Paul Price [ 08/Dec/14 ]
Thanks for fixing that!
Merged this to master. It should be in the next release.
Generated at Fri May 12 10:44:06 EDT 2017 using JIRA 6.2.2#6258sha1:65ffb4362589622c100f6488635539584b0f7b98.