Programming the e500/e200 SPE
The e500v2 and e200z6 are Freescale PowerPC cores which can be found in system-on-chips such as the MPC8536E, P2020 and MPC55xx. These cores features a Signal Processing Engine Auxiliary Processing Unit (SPE-APU).
That's a 64-bit wide SIMD unit with accumulator register. Typically it will outperform generic C code in case of many multiply-accumulate operations such as used in FIR or FFT filters.
In my customer's application case, I was interested in optimizing the averaging of trains of 16-bit signed integer values, using GCC 4.3.3.
The results (numbers representing factor of speed-up):
- 2.09 gain by using the SPE-APU intrinsics.
- 1.37 gain by -funroll-all-loops -funroll-loops.
- 1.37 gain by aligning the sample trains to the cache line size.
- 1.22 gain by manually unrolling the averaging loop by a factor of two.
All in all, it uses 22% of CPU time compared against the plain C implementation compiled with similar optimization flags.
More optimization might be possible, but this analysis gave enough information to proceed with architectural choices during this phase of the project.
Although I discourage early optimization, I do promote early design space exploration: the mapping of functions to the most efficient units.
Thanks to Khem Raj for some of the cache and compiler optimization suggestions and help.