Floating in numbers

Just like an iceberg, floating point numbers can have hidden danger lurking just under the surface.

A recent email through a mailing list reminded me about the subtlties of handling floating point numbers. Floating point numbers are represented by mantissa*(base**exponent), see floating point numbers.

There are many occassions where programmers may forget that floating point numbers do not behave the way you would expect them to. Understanding the behaviour of floating point numbers can have a huge impact on writing computer code, especially within science. I sometimes hear the following statements about floating point numbers and why it should not matter:

1. I do not need high precision
2. I always use double precision reals to store calculations
3. If my work gives a result then that is all that matters

Lets look at each of these statements.

“I do not need high precision”

Floating point could be considered to do with precision, however not in the sense that most people think. Precision is always relative and not to do about how many decimal places you want to store. Lets take some example Python code where we may want to sum the numbers between 1 and 8192 which on the face of it should not require high precision.

#!/usr/bin/env python
import numpy
sum = numpy.float32(0.0)
# N.B. range requires [1,8193) (or 1-8192)
for x in range(1,8193):
sum += numpy.float32(x)

print "Sum = %8.0f" % sum

This would seem to be fairly straight forward summation. Running the code:

Sum = 33557328

If I were to calculate this with the formula n*(n+1)/2 (see trick by Gauss) then the sum should be 33558528. Why the difference?

The difference can be attributed to the operation of addition that has to make the numbers you are adding have the same exponent and therefore lose precision in the mantissa, eventually the smaller number has to lose so much of the mantissa to have the same exponent then precision is lost.

“I always use double precision reals to store calculations”

You may have noticed I used numpy.float32 type in the previous example. We could calculate what summation would require the numpy.float64 type to lose precision (will leave reader that exercise). Would a double precision which uses twice the storage mean I do not need to worry about it?

In essence it is usually recommended to use double precision floating point numbers but it does not mean you can forget about handling floating point numbers. For example (1+x)-x should be 1 but due to artificially pushing the calculation to lose precision we get:

>>> (numpy.float64(1.0)+numpy.float64(2**53))-numpy.float64(2**53)
0.0

What about 1+(x-x)? This should be fine since we operate on similar sized numbers.

>>> numpy.float64(1.0)+(numpy.float64(2**53)-numpy.float64(2**53))
1.0

So we cannot assume associative relationship in the calculation of floating point numbers. This has an impact on parallel computing where you might assume a+b+c+d could be divided between tasks into a+b and c+d but when we go to sum the result it could be different to the 1 task version.

To combat getting different answers with parallel operations such as summations, there is a nice trick by Kahan which keeps track of the erros and corrects the calculation and therefore gives you more accuracy in your results, see Kahan summation algorithm. This method can be parallised with MPI using custom-defined operators within the reduce step and store the calculation within a COMPLEX type. See Accurate numerics in parallel applications

“If my work gives a result then that is all that matters”

Within science it is important to be able to reproduce and provide some level of confidence about your results, it might be true if the answer gives a plausible number then it can still be used, however you need to know what impact this result will have. A number of examples of real world failures due to not handling float point numbers is present, including Patriot Missle failure and explosion of Ariane 5 rocket.

For more see Some disasters caused by numerical errors

Summary

Every computer programmer should understand about floating point numbers, or at least know the limits of the calculations. This allows there to be greater confidence in your results and reduce the wasted time chasing errors which are due to floating point behaviour.