How C++14 and C++17 help to write faster (and better) code. Real world examples

Writing high performance code is always a difficult task. Pure algorithms are not always a good fit for the real world architectures.
Once having begun to speed up all these pure algorithms, we quickly find that some implementation is pretty fast on one architecture, but terrible slow on another while second implementation outperform the first one in some contexts losing speed in all the rest.

Numerous big and small optimizations for each of supported architectures can quickly bloat our code and waste our time.
Often, we’re limited to just two options: either beautiful and slow or fast and ugly.

Perhaps you have asked about modern compiler optimizations?

Nowadays compilers can do many things, but definitely not all. For example, all the algorithms described in this article can’t be fairly optimized and vectorized because of its complexity.

In this article I’ll show when C++17 and C++14 can help to write fast, compact and well-structured code.

Fast Fourier Transform that we can make even faster

The Fast Fourier Transform (FFT) is a central algorithm in Digital Signal Processing and has a wide application in many other tasks.

In a few words, FFT transforms data (signal) from the time domain
(each sample represents the signal at some point in time)
to the frequency domain
(each sample represents the power of some frequency)

The performance of an implementation of the Fast Fourier Transform often greatly affects the overall performance of an application due to the large number of calls to FFT.

Some parts of FFT implementation in the KFR Framework were greatly improved by using C++14.

Example 1. Specialization

To achieve maximum speed, FFT must be specialized for some small sizes, such as 2, 4, 8, 16, 32, 64, 128 and 256.

This code example shows how to select specialization based on a size, known only at runtime.

C++14 version (source code on GitHub):

Hereinafter, I use cswitch, cif and some other functions from
CoMeta C++14 Metaprogramming library (MIT license)
which was written during the development of the KFR Framework.

The above code is equivalent to the following code:

C++03 or C++11 version:

There is another way to do it in C++11: extract code to a new functor, make this functor’s call operator generic (to be able to get compile-time values) and pass all local variables to it. But this introduces undesired complexity and anyway is not as easy as calling cswitch in C++14.

cswitch function

It can be used when a value known at runtime must appear in a template argument list. There is only one way to do it internally: compare the value to the every value from the predefined list and call the corresponding function template. But C++14 can help in this situation and we aren’t required to write all these comparisons manually. To be concrete, generic lambdas is the feature that make it working.

Here is cswitch source code (GitHub):

As you can see, fallback function is optional.

This code will be expanded into a sequence of comparisons.

swallow is a struct with constructor that accept initializer_list.
List-initialization is used to guarantee that arguments are evaluated sequentially.
Upcoming C++17 will introduce fold-expressions that can be used to achieve the same effect in a much more elegant way.

SIMD and C++

FFT consists of small routines called butterflies which have to be highly optimized in terms of usage of SIMD registers. Let’s count all the specializations needed for the full FFT algorithm.

  1. Firstly, we have two types: float and double,
    so we start from 2 specializations.
  2. Code must be optimized for SSE.x and AVX.x
    which have different SIMD register sizes: 4.
  3. x86 has 8 SIMD registers (even for AVX) while x86-64 has 16.
    In x86-64 we can double the amount of values processed
    in one step using two registers instead of one.
    We can not miss this opportunity, so let’s double all our specializations.
    Now we have 8 specializations.
  4. In order to be able to do both direct and inverse FFT
    using the same precalculated data
    (inverse FFT is an exact same algorithm as a direct FFT
    except of conjugation of all coefficients
    ),
    we have to modify all our butterflies: 16.
  5. One of FFT stages always multiplies input data by 1.0,
    therefore we can (and must, if we want to get high performance)
    completely remove the multiplication.
    The number of specializations became 32.
  6. Data can be unaligned or aligned.
    Access to aligned data using unaligned instructions will be slow on SSE.
    Access to unaligned data using aligned instructions will always lead to segfault.
    Nothing remains except to make 48 specializations.

Once again: 48 different specializations for a radix-4 butterfly. Some of these specializations are redundant, but an overall count of optimized implementations goes far beyond 20.
In C we would have to write and maintain all these functions.

What if I say that all these specializations can have exactly the same source code in C++14? All differences that are needed for achieving the best results in every configuration can be passed as template arguments.

Further, the butterfly operates on the type vec<T, width>
which hides all implementation details and has unified public interface.

C++ gives us a lot more advantages over a plain C in reusing the code.
But C++14 goes further.

More examples

Several functions that benefit from C++11 or C++14.

counter

Probably the simplest function in the KFR Framework.

counter is a generator that outputs values sequentially starting from 0 (GitHub).

cinput_t indicates that we read from an expression rather than write to it. This type is used to distinct what we want to do when the same object implements both input and output operator()
index gets its value from the for-loop variable when the template expression is executed.
x controls the result type of this function.
lambda encapsulates lambda to an expression object.

Without generic lambdas we would have to write a functor with generic operator().

sequence

This one outputs given values sequentially and repeats from the beginning when it reaches the end.

This function utilizes Initialized lambda captures, Return type deduction for normal functions and Generic lambdas, all from C++14 standard.

Conclusion

Modern standards of C++ are more than just a revision of the syntax or new functions and types.

They have many practical implementations, including a software where the high-performance is the main goal and can help with explicit optimization and vectorization in order to achieve the best speed results without code bloat.

With modern standards of C++ code can be fast and beautiful, scalable and maintable at the same time.

All these techniques are widely used in the KFR C++ DSP framework (GPL/Commercial) for template expressions.

Dan Levin, KFRLIB.COM

15 thoughts on “How C++14 and C++17 help to write faster (and better) code. Real world examples”

  1. For the cswitch, why not use a lookup table instead (which should be be made by the compiler with the switch/case too)?
    Because if I’m not wrong in the expansion, all of the variadic values are tested whatever the previous result.

  2. By example something like that:

    1. Interesting idea. But lookupTable should contain pointers to Function::operator()<cval_t<T, vs>> not the functions themselves.

  3. I’m confused… the title mentions C++14 and C++17 but the article says “… C++11 and C++14 can help to…”. Which is it?

  4. Would you care to explain why void() is required in cswitch:

    (function(cval<T, vs>), void(), true)

    I understand that we use comma to discard the nonexistent result of function() call and return true to mark the sequence as already handled, but why do we have to introduce void() there?

    1. If a user defined comma operator exists that accepts the function result type and bool, then it will be called in the construction (fn(...), true).
      void() is used to prevent this because no one can define operator,(..., void) or operator,(void, bool).

Leave a Reply

Your email address will not be published. Required fields are marked *