Working with DRMAA
2 March 2015
Anyone who might have read earlier posts might have come across previous post on Galaxy. This used DRMAA to submit jobs when using a PBS Pro as a job scheduler. However recent issues whilst we upgraded our systems uncovered some unexpected issues with DRMAA.
Going native with DRMAA
First issue is with a feature of DRMAA called nativeSpecification
. This allows specific use of a job scheduler feature within a vendor neutral DRMAA library. Seems not all features were supported in the specific DRMAA library we were using (pbs-drmaa from http://apps.man.poznan.pl/trac/pbs-drmaa). Specifically we had to add project option (-P) to allow our project usage statistics to pick up the usage. Will have to remember to feedback the changes.
Python DRMAA package
Galaxy used the Python package from the Github project python-drmaa, and is a wrapper around the DRMAA library such as pbs-drmaa and works very well. Turned out there was a slight issue with key/value splitting a string on “=” where the value can contain “=” symbol as well. As with everything Python simple fix was to add the maximum number of splits to be 1, such as:
k,v = attr.split("=", 1)
The biggest drama with DRMAA
The biggest issue which was encountered was how Galaxy uses DRMAA. In tests with DRMAA a single user could submit and check jobs with no problem on our cluster – more on that later. However Galaxy submits the job as the actual user logged into the website but checks the job status as the Galaxy server user. This caused errors such as tasks reported to have failed but actually succeeded in PBS Pro. It turned out with PBS Pro the pbs-drmaa library could not check the PBS Pro job history and would therefore not find the job when the job finished. The fallback was to use a record of the job created in the users $HOME
directory. Therefore the fallback method works for a single user using DRMAA but if another user tried to check the job status it could not use the fallback method since it had no access to the user’s $HOME
directory who owns the job.
The first attempt at fixing it was to create a common state directory where DRMAA results are stored in the fallback method. For example /etc/pbs_drmaa.conf
contains:
user_state_dir: "/home/drmaa"
This worked and the issue with we asked PBS Pro developers how we could handle job histories better within a C library.
I then went back and re-read the documentation on the pbs-drmaa website, where it actually states the job history does not work for PBS Pro and other job schedulers and therefore the PBS Pro server log method should be used. For example by setting pbs_home
it should find the server_logs
directory, /etc/pbs_drmaa.conf
contains:
pbs_home: "/PBS/" wait_thread: 1
I then spent time trying to get the server logs to work in the test cases I had (one user submits and another checks) but still could not get it to work (still seemed to use the PBS statjob function which does not report finished jobs). I think the use case where DRMAA is used purely to monitor a job rather than submit and then monitor is not clear how it should work (how long ago should a job be available before its lost from any history?).
Therefore at the moment we are using a common /tmp
style directory to store the DRMAA exitstatus and other information while more “correct” fixes are investigated.
DRMAA as a solution
The idea of DRMAA is a good one. It seems a sensible approach to create vendor-neutral libraries. It would be good to clarify how different users should be able to check other users jobs – and it is also a good example why these issues should be fed back to the developers since other probably have experienced similar issues.