Accelerating Algorithms: Considerations in Design, Algorithm Choice and Implementation

If you are trying to make your algorithms run faster, you may want to consider reviewing some important points on design and implementation.
c
comments

By Tom Radcliffe.

The pursuit of speed is one of the few constants in computing, and it is driven by two things: ever-increasing amounts of data, and limited hardware resources. Today in the era of Big __data rather than slavishly using every available data point. This was the trick behind pseudo-correlation, which is an ultra-fast image registration algorithm: by using a small sub-set of pixels to estimate a cross-correlation integral it was possible to speed up registration by orders of magnitude. A similar approach was used in the mutual information algorithm several years later.

The overarching design question should be: “Do I really need to compute that?” There may be something that is strongly correlated with the result you care about, but which can be computed much more quickly.

ALGORITHM CHOICE

The order of an algorithm gets a lot of academic attention but is surprisingly unimportant. The reasons for this are:

  1. Problems often come with a given scale, so the scaling behaviour of the algorithm is irrelevant. If something runs well on ten million datapoints, its performance on a hundred million is irrelevant if we never see datasets that large.
  2. Leading factors matter: an O(N**2) algorithm may have a much smaller leading factor than an equivalent O(N*log(N)) algorithm due to lower complexity.
  3. Speaking of complexity, algorithms with better big-O performance are often more complex, making them more fragile, difficult to debug, and difficult to maintain.

The most famous case of fragility is probably quicksort, which degrades from O(N*log(N)) to O(N**2) when the input is already sorted. This is a fortunately easy case to spot, but more complex algorithms with good average performance can still fail badly on special cases, and debugging this behaviour can be complex and time-consuming. Because such failures are typically rare, it is also not uncommon to encounter them after deployment, and be faced with a weird situation where “sometimes things run slowly” for no readily apparent reason.

In algorithm choice it is best to implement the simplest thing first, then quantify its performance on realistic datasets and decide if more complex algorithms are required. More often than not they won’t be. I once worked on some mesh registration code that used KD trees to match points, and found that because the datasets were reasonably small a simple linear search with a few easily understood heuristics would comfortably out-perform the nominally much-better KD tree implementation, which was a fragile and complex mess whose unmaintainability prompted me to fall back on something simpler so I could understand what was going on with it.

IMPLEMENTATION DETAILS

An algorithm is not its implementation. Even relatively simple and efficient algorithms can be badly implemented, which may result in slower performance. Unnecessary function calls are one common case. Use of recursion rather than loops can impact speed.

In many cases, high-performance code is written in a relatively low-level language like C, C++ or even Fortran, and called from a high-level language. How these calls are ordered and organized can make a big difference to performance: in general calling across language boundaries is an expensive operation, so using interfaces in such a way as to minimize this is recommended.

High level languages often include tricks that speed things up. For example, Python’s sum() and map() functions are both much faster on arrays than loops. Looping over 100,000,000 integers and summing them up takes 4.4 seconds on my desktop machine, but 0.63 seconds if I use sum() on the array. Furthermore, generators in loops (xrange rather than range in Python 2, range() in Python 3) are generally faster than non-lazy list creation functions.

The map() function is much faster than looping: looping over an array of 100,000,000 integers and taking the square root takes 10 seconds on my desktop, but the same process using map(math.sqrt, lstData) takes half that time.

LOWER LEVEL APPROACHES

There are even more powerful techniques available in Python, such as using NumPy and Cython and Intel's Math Kernel Library (MKL), which is available in ActiveState’s ActivePython. There may be constraints that prevent the use of one or more of these techniques for a given problem, but they are worth keeping in mind as part of your “efficiency arsenal”. Just be sure to quantify your algorithm’s performance first!

To learn more about accelerating algorithms and get more insights into Intel’s MKL optimizations, watch my webinar, "Accelerating Your Algorithms in Production with Python and Intel MKL", co-hosted with Sergey Maidanov, Software Engineering Manager from Intel.

Original. Reposted with permission.

Related

  • Robust Algorithms for Machine Learning
  • 7 Super Simple Steps From Idea To Successful data Science Project
  • Top data Science and Machine Learning Methods Used in 2017