Skip to main content

DeveloperHPC

Postive performance with C++

1 October 2014

Recently a researcher has come to us with a performance issue in their code.  The code was written in C++ and was taking a very long time to run – reaching a limit on the time a job can run within our scheduler.

Profile of the culprit

First thing to do when given a task such as this is to profile the code.  This code was actually quite interesting due to the use of newer C++ features.  This caused issues with the Intel Compiler, and a new GNU compiler had to be built since we had a mature version.  Since this code used OpenMP with the GNU compiler this seemed to cause problems in the Intel suite of tools we had installed, therefore the basic profiling tool was used.  This involves compiling/linking with the GNU compiler and using the profile options -pg alongside the debugging symbol option -g. For example:

$ g++ -g -pg hello.cc -o hello

When the program is run a file called gmon.out is created where the profile information is written. Profiling can take a lot longer to run so be prepared for a long wait. The gmon.out file has to be used with the gprof utility (and piped to a program), for example:

$ gprof hello gmon.out | less

Finding and fixing the bottleneck

The profile listing will display a long list of functions and the amount of time used in each function. Hopefully a particular function will be highlighted as using a large proportion of the executable time.

In this case the function causes performance problems was a simple function which loops over a vector, something like:

std::shared_ptr<std::vector> costs;
for (int k = 0; k size(); k++) {
  for (int i = 0; i size(); i++) {
    for (int j = 0; j size(); j++) {
      double cost;
      cost = costs->at(i*costs->size() + k) + costs->at(k*costs->size() + j)
      if (cost at(i*numberOfNodes+j)) {
        costs->at(i*numberOfNodes+j) = cost;
      }
    }
  }
}

As you might be able to see, the loop is quite heavily nested. The profiler picked up that the locking when accessing the shared_ptr accesses inside the loop and was taking up quite a large proportion of the running time. It turned out there was quite a simple change if we are sure we do not need the shared_ptr features at this stage of execution we can instead do this:

std::shared_ptr<std::vector> costs;
// Use the raw pointer to the underlying vector data instead of through shared_ptr.
double *costs_ptr = costs->data()
for (int k = 0; k size(); k++) {
  for (int i = 0; i size(); i++) {
    for (int j = 0; j size(); j++) {
      double cost;
      cost = costs_ptr[i*costs->size() + k] + costs_ptr[k*costs->size() + j]
      if (cost < costs_ptr[i*numberOfNodes+j]) {
        costs_ptr[i*numberOfNodes+j] = cost;
      }
    }
  }
}

Now when the loop is accessing the costs vector it is through the raw pointer and therefore the cost of the locking is removed. The speedup was around 4x which is quite good. Other optimisations were possible but this was the main issue affecting the code.