Programming the e500/e200 SPE

The e500v2 and e200z6 are Freescale PowerPC cores which can be found in system-on-chips such as the MPC8536E, P2020 and MPC55xx. These cores features a Signal Processing Engine Auxiliary Processing Unit (SPE-APU).

That's a 64-bit wide SIMD unit with accumulator register. Typically it will outperform generic C code in case of many multiply-accumulate operations such as used in FIR or FFT filters.

In my customer's application case, I was interested in optimizing the averaging of trains of 16-bit signed integer values, using GCC 4.3.3.

The results (numbers representing factor of speed-up):

  • 2.09 gain by using the SPE-APU intrinsics.
  • 1.37 gain by -funroll-all-loops -funroll-loops.
  • 1.37 gain by aligning the sample trains to the cache line size.
  • 1.22 gain by manually unrolling the averaging loop by a factor of two.

All in all, it uses 22% of CPU time compared against the plain C implementation compiled with similar optimization flags.

More optimization might be possible, but this analysis gave enough information to proceed with architectural choices during this phase of the project.

Although I discourage early optimization, I do promote early design space exploration: the mapping of functions to the most efficient units.

Thanks to Khem Raj for some of the cache and compiler optimization suggestions and help.