We recently installed new nodes into our supercomputer. As part of the process we have to run tests to assess the new nodes. Since this was an extension to the existing infrastructure there were limits on how many high-speed links (or Infiniband (IB) connections) could be wired in, and the use for the nodes would not require a all nodes to talk to every other node (i.e. it is not a tightly coupled topology between all the nodes).
Comparison with existing infrastructure
The new nodes use Infiniband at FDR speeds, whilst our existing nodes and network infrastructure use QDR. Some background, a node is similar to a PC, but 18 nodes are contained with a chassis (CMC) in blade format. Our current MPI partition has 18 IB connections coming out of a CMC so each node has the throughput required. The new nodes only have 6 IB connections per chassis. During the testing of the new nodes it turned out we can increase the performance of the Infiniband traffic for jobs who saturate the MPI communications between the nodes if the jobs are contained within a CMC since they share a common IB backplane (and at FDR speeds).
Implementing this in a job scheduler
To achieve this with our scheduler PBS Pro we used the feature called “placement sets” that allow nodes to be aware of resources that describe the topology of the connections between the nodes. We do not expect to need to run jobs over 18 nodes so this solution should work for our case. We will have to monitor the use of the nodes over time.
We first had to add a new resource to the scheduler using:
HWcmc type=string_array flag=h
And then after reloading the server to load in the new resources from its resourcedef file we added the resource to each node.
set node raven300 resources_available.HWcmc = HWcmc13
The value corresponds the number of CMC given in our overall design plans for consistent naming. Finally for the queue which will use this opion we provide a
set queue workq_haswell node_group_key = HWcmc
This then will tell the scheduler to look for similar values of
HWcmc for nodes rather than spreading them out across the cluster.
Be ware of the network topology
This has hopefully explained that even when the network topology looks like it may have limits we can exploit internal designs through the use of the job scheduler to get better performance overall. This may be just a special case for our type of work but it should prove useful in the long-term running of the new hardware.