Writing high performance code is always a difficult task. Pure algorithms are not always a good fit for the real world architectures.

Once having begun to speed up all these pure algorithms, we quickly find that some implementation is pretty fast on one architecture, but terrible slow on another while second implementation outperform the first one in some contexts losing speed in all the rest.

Numerous big and small optimizations for each of supported architectures can quickly bloat our code and waste our time.

Often, we’re limited to just two options: either beautiful and slow or fast and ugly.

Perhaps you have asked about modern compiler optimizations?

Nowadays compilers can do many things, but definitely not all. For example, all the algorithms described in this article can’t be fairly optimized and vectorized because of its complexity.

In this article I’ll show when C++17 and C++14 can help to write fast, compact and well-structured code.

## Fast Fourier Transform that we can make even faster

The Fast Fourier Transform (FFT) is a central algorithm in Digital Signal Processing and has a wide application in many other tasks.

In a few words, FFT transforms data (signal) from the time domain

(each sample represents the signal at some point in time)

to the frequency domain

(each sample represents the power of some frequency)

The performance of an implementation of the Fast Fourier Transform often greatly affects the overall performance of an application due to the large number of calls to FFT.

Some parts of FFT implementation in the *KFR Framework* were greatly improved by using C++14.

### Example 1. Specialization

To achieve maximum speed, FFT must be specialized for some small sizes, such as 2, 4, 8, 16, 32, 64, 128 and 256.

This code example shows how to select specialization based on a size, known only at runtime.

C++14 version (source code on GitHub):

1 2 3 4 5 6 7 8 9 10 11 12 |
cswitch(csizes<1, 2, 3, 4, 5, 6, 7, 8>, log2n, [&](auto log2n) { // if log2n was found in the list, create specialized stage add_stage<internal::fft_specialization_t<T, val_of(log2n), false>::template type>(size, type); }, [&]() { //if log2n wasn't found in the list, create generic stage // generic stage has two specializations for even and odd log2n, so select it cswitch(cfalse_true, is_even(log2n), [&](auto is_even) { make_fft(size, type, is_even, ctrue); add_stage<internal::fft_reorder_stage_impl_t<T, val_of(is_even)>::template type>(size, type); }); }); |

Hereinafter, I use

`cswitch`

,`cif`

and some other functions from

CoMeta C++14 Metaprogramming library(MIT license)

which was written during the development of theKFR Framework.

The above code is equivalent to the following code:

C++03 or C++11 version:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 |
switch(log2n) { case 1: add_stage<internal::fft_specialization_t<T, 1, false>::template type>(size, type); break; case 2: add_stage<internal::fft_specialization_t<T, 2, false>::template type>(size, type); break; case 3: add_stage<internal::fft_specialization_t<T, 3, false>::template type>(size, type); break; case 4: add_stage<internal::fft_specialization_t<T, 4, false>::template type>(size, type); break; case 5: add_stage<internal::fft_specialization_t<T, 5, false>::template type>(size, type); break; case 6: add_stage<internal::fft_specialization_t<T, 6, false>::template type>(size, type); break; case 7: add_stage<internal::fft_specialization_t<T, 7, false>::template type>(size, type); break; case 8: add_stage<internal::fft_specialization_t<T, 8, false>::template type>(size, type); break; default: if(is_even(log2n)) { make_fft(size, type, cbool<true>, ctrue); add_stage<internal::fft_reorder_stage_impl_t<T, true>::template type>(size, type); } else { make_fft(size, type, cbool<false>, ctrue); add_stage<internal::fft_reorder_stage_impl_t<T, false>::template type>(size, type); } } |

There is another way to do it in C++11: extract code to a new functor, make this functor’s call operator generic (to be able to get compile-time values) and pass all local variables to it. But this introduces undesired complexity and anyway is not as easy as calling `cswitch`

in C++14.

`cswitch`

function

It can be used when a value known at runtime must appear in a template argument list. There is only one way to do it internally: compare the value to the every value from the predefined list and call the corresponding function template. But C++14 can help in this situation and we aren’t required to write all these comparisons manually. To be concrete, generic lambdas is the feature that make it working.

Here is `cswitch`

source code (GitHub):

1 2 3 4 5 6 7 8 9 |
template <typename T, T... vs, typename U, typename Function, typename Fallback = DoNothing> void cswitch(cvals_t<T, vs...>, const U& value, Function&& function, Fallback&& fallback = Fallback()) { bool result = false; swallow{ (result = result || ((vs == value) ? (function(cval<T, vs>), void(), true) : false), void(), 0)... }; if(!result) fallback(); } |

As you can see, fallback function is optional.

This code will be expanded into a sequence of comparisons.

1 2 3 4 5 6 7 8 |
bool result = false; result = result || ((vs == value) ? (function(cval<T, vs>), void(), true) : false); result = result || ((vs == value) ? (function(cval<T, vs>), void(), true) : false); ... result = result || ((vs == value) ? (function(cval<T, vs>), void(), true) : false); if(!result) fallback(); |

`swallow`

is a struct with constructor that accept `initializer_list`

.

List-initialization is used to guarantee that arguments are evaluated sequentially.

Upcoming C++17 will introduce *fold-expressions* that can be used to achieve the same effect in a much more elegant way.

### SIMD and C++

FFT consists of small routines called *butterflies* which have to be highly optimized in terms of usage of SIMD registers. Let’s count all the specializations needed for the full FFT algorithm.

- Firstly, we have two types:
`float`

and`double`

,

so we start from`2`

specializations. - Code must be optimized for SSE.x and AVX.x

which have different SIMD register sizes:`4`

. - x86 has 8 SIMD registers (even for AVX) while x86-64 has 16.

In x86-64 we can double the amount of values processed

in one step using two registers instead of one.

We can not miss this opportunity, so let’s double all our specializations.

Now we have`8`

specializations. - In order to be able to do both direct and inverse FFT

using the same precalculated data

(*inverse FFT is an exact same algorithm as a direct FFT*),

except of conjugation of all coefficients

we have to modify all our butterflies:`16`

. - One of FFT stages always multiplies input data by 1.0,

therefore we can (and must, if we want to get high performance)

completely remove the multiplication.

The number of specializations became`32`

. - Data can be unaligned or aligned.

Access to aligned data using unaligned instructions will be slow on SSE.

Access to unaligned data using aligned instructions will always lead to segfault.

Nothing remains except to make`48`

specializations.

Once again: **48** different specializations for a radix-4 butterfly. Some of these specializations are redundant, but an overall count of optimized implementations goes far beyond 20.

In C we would have to write and maintain all these functions.

What if I say that all these specializations can have exactly the same source code in C++14? All differences that are needed for achieving the best results in every configuration can be passed as template arguments.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 |
template <size_t width, bool use_br2, bool inverse, bool aligned, typename T> inline void radix4_body( size_t N, // fft size csize_t<width>, // SIMD width cfalse_t, cfalse_t, cfalse_t, cbool_t<use_br2>, // use bit-reversal permutation cbool_t<inverse>, // true is inverse FFT cbool_t<aligned>, // true if aligned access complex<T>* out, const complex<T>* in, const complex<T>* twiddle) { const size_t N4 = N / 4; cvec<T, width> w1, w2, w3; cvec<T, width> sum02, sum13, diff02, diff13; cvec<T, width> a0, a1, a2, a3; a0 = cread<width, aligned>(in + 0); a2 = cread<width, aligned>(in + N4 * 2); sum02 = a0 + a2; a1 = cread<width, aligned>(in + N4); a3 = cread<width, aligned>(in + N4 * 3); sum13 = a1 + a3; cwrite<width, aligned>(out, sum02 + sum13); w2 = sum02 - sum13; cwrite<width, aligned>( out + N4 * (use_br2 ? 1 : 2), radix4_apply_twiddle(csize<width>, cfalse, cbool<inverse>, w2, cread<width, true>(twiddle + width))); diff02 = a0 - a2; diff13 = a1 - a3; if (inverse) { diff13 = (diff13 ^ broadcast<width * 2, T>(T(), -T())); diff13 = swap<2>(diff13); } else { diff13 = swap<2>(diff13); diff13 = (diff13 ^ broadcast<width * 2, T>(T(), -T())); } w1 = diff02 + diff13; cwrite<width, aligned>( out + N4 * (use_br2 ? 2 : 1), radix4_apply_twiddle(csize<width>, cfalse, cbool<inverse>, w1, cread<width, true>(twiddle + 0))); w3 = diff02 - diff13; cwrite<width, aligned>(out + N4 * 3, radix4_apply_twiddle(csize<width>, cfalse, cbool<inverse>, w3, cread<width, true>(twiddle + width * 2))); } |

Further, the butterfly operates on the type `vec<T, width>`

which hides all implementation details and has unified public interface.

C++ gives us a lot more advantages over a plain C in reusing the code.

But C++14 goes further.

## More examples

Several functions that benefit from C++11 or C++14.

### counter

Probably the simplest function in the *KFR Framework*.

`counter`

is a *generator* that outputs values sequentially starting from 0 (GitHub).

1 2 3 4 5 6 7 |
inline auto counter() { return lambda([](cinput_t, size_t index, auto x) { return enumerate(x) + index; }); } |

`cinput_t`

indicates that we read from an expression rather than write to it. This type is used to distinct what we want to do when the same object implements both input and output `operator()`

`index`

gets its value from the for-loop variable when the template expression is executed.

`x`

controls the result type of this function.

`lambda`

encapsulates lambda to an expression object.

Without generic lambdas we would have to write a functor with generic `operator()`

.

### sequence

This one outputs given values sequentially and repeats from the beginning when it reaches the end.

This function utilizes *Initialized lambda captures*, *Return type deduction for normal functions* and *Generic lambdas*, all from C++14 standard.

1 2 3 4 5 6 7 8 |
template <typename... Ts, typename T = common_type<Ts...>> inline auto sequence(const Ts&... list) { return lambda([seq = std::array<T, sizeof...(Ts)>{ list... }](size_t index) { return seq[index % seq.size()]; }); } |

## Conclusion

Modern standards of C++ are more than just a revision of the syntax or new functions and types.

They have many practical implementations, including a software where the high-performance is the main goal and can help with explicit optimization and vectorization in order to achieve the best speed results without code bloat.

With modern standards of C++ code can be fast and beautiful, scalable and maintable at the same time.

All these techniques are widely used in the KFR C++ DSP framework (GPL/Commercial) for template expressions.

Dan Levin, KFRLIB.COM