Within the ARCCA team we are investigating ways we can open up high-performance computing (HPC) to a wider audience. I should add this is not just an issue within ARCCA but a more general trend across HPC organisations. The name for website which provide easier access to HPC are called Portals or Gateways. The Extreme Science and Engineering Discovery Environment (XSEDE) have compiled a list of possible Gateways here.
Focus on biosciences
Within Cardiff University a new field of researchers on our HPC services are working with large datasets which contain genetic information. After discussions with the bioscience community a Portal called Galaxy was investigated to see whether we could interface it with our HPC system. We first downloaded Galaxy at the end of 2013, it has had a number of releases since then and we should update now that we have a working solution. Galaxy provides a website (as well as being a webserver in itself) and therefore we set this up on a Virtual Machine (VM) which can interface with our HPC system, rather than having to run a webserver directly on our HPC system. This required setting up Apache, including LDAP authentication which is used in Cardiff University. Galaxy uses the Apache LDAP module to know when a user is logged into the site.
Starting up the Portal
Here are the kind of issues we had to resolve before it was deemed to be working:
Galaxy seemed to be setup in certain cases that were known to work and therefore the reporting of issues in code locations that do not usually fail were absent. For example:
diff -r a477486bf18e lib/galaxy/jobs/__init__.py --- a/lib/galaxy/jobs/__init__.py Thu Sep 26 11:02:58 2013 -0400 +++ b/lib/galaxy/jobs/__init__.py Fri Oct 24 11:53:39 2014 +0100 @@ -1445,6 +1445,9 @@ p = subprocess.Popen( cmd, shell=False, stdout=subprocess.PIPE, stderr=subprocess.PIPE ) # TODO: log stdout/stderr stdout, stderr = p.communicate() + if p.returncode != 0: + log.debug('%s'%(stdout)) + log.debug('%s'%(stderr)) assert p.returncode == 0 def change_ownership_for_run( self ):
Obviously it was possibly a known issue that the logging was not available but this allowed us to pinpoint where we were having failures at this point.
DRMAA is a technology to allow cross-vendor code support within schedulers – see DRMAA website for further information. It turns out that a few tweaks to the code were required to get it working, for example:
--- a/lib/galaxy/jobs/runners/drmaa.py Thu Sep 26 11:02:58 2013 -0400 +++ b/lib/galaxy/jobs/runners/drmaa.py Fri Oct 24 11:59:06 2014 +0100 @@ -43,6 +43,11 @@ # - execute the command # - take the command's exit code ($?) and write it to a file. drm_template = """#!/bin/sh + +# Added to find matching Python version between Galaxy server and HPC system. +module load pyenv +module load python/2.7.4 + GALAXY_LIB="%s" if [ "$GALAXY_LIB" != "None" ]; then if [ -n "$PYTHONPATH" ]; then @@ -55,7 +60,9 @@ %s cd %s %s -echo $? > %s +CC=$? +echo $CC > %s +exit $CC """ DRMAA_jobTemplate_attributes = [ 'args', 'remoteCommand', 'outputPath', 'errorPath', 'nativeSpecification', @@ -129,8 +136,8 @@ jt = self.ds.createJobTemplate() jt.remoteCommand = ajs.job_file jt.jobName = ajs.job_name - jt.outputPath = ":%s" % ajs.output_file - jt.errorPath = ":%s" % ajs.error_file + jt.outputPath = "localhost:%s" % ajs.output_file + jt.errorPath = "localhost:%s" % ajs.error_file # Avoid a jt.exitCodePath for now - it's only used when finishing. native_spec = job_destination.params.get('nativeSpecification', None)
The key points which were corrected were the implementation of DRMAA in our scheduler required a hostname in the outputPath and errorPath (localhost was sufficient). Also some tweaks to the script which runs to point to our Python installation which is controlled by modules – the exit codes were added for completeness and also in the pbs.py in the same directory. Making sure we know when scripts fail is important to pass on.
Other locations where the DRMAA behaviour needed to be changed was the following:
--- a/scripts/drmaa_external_runner.py Thu Sep 26 11:02:58 2013 -0400 +++ b/scripts/drmaa_external_runner.py Fri Oct 24 12:09:14 2014 +0100 @@ -30,6 +30,8 @@ def load_job_template_from_file(jt, filename): f = open(filename,'r') data = json.load(f) + if "jobName" in data: + data["jobName"] = data["jobName"][0:15] for attr in DRMAA_jobTemplate_attributes: if attr in data: setattr(jt, attr, data[attr]) @@ -81,6 +83,9 @@ # Get user's default group and set it to current process to make sure file permissions are inherited correctly # Solves issue with permission denied for JSON files gid = pwd.getpwuid(uid).pw_gid + # Setting HOME stops error reported on remote system due to galaxy user + # not available on remote machine (DRMMA library writes to HOME). + os.environ["HOME"] = pwd.getpwuid(uid).pw_dir os.setgid(gid) os.setuid(uid) except OSError, e:
It turned out our scheduler did not like having job names longer that 15 characters (and would actually refuse to run them if they did). Also the underlying DRMMA library writes to $HOME which needs to be set. We also had issues with the pbs-drmaa library which did not like being used in the way Galaxy was submitting jobs – it got very confused what the user was and hence $HOME and permissions was not what the code expected (I have left out the code changes for that library from this post).
In a Galaxy far far…
Galaxy provides a website which contains a majority of the tools required to perform analyses on bio-information. After some tweaking of the code to fit in with our scheduler and storage systems it has now been shown to work as expected and have successfully run some jobs. The plans are to roll-out the Galaxy website across the bioscience community at Cardiff University and also look at how best to upgrade Galaxy on a more regular basis.