I’ve been working on optimizing a piece of code for ARM Cortex-A53, and I thought I’d share my journey and some insights. The code in question calculates the absolute value of a complex float vector, which involves some vectorized operations. Here’s a quick overview of what I did and what I learned.
Initially, the code used a straightforward loop to process the vector elements. It worked, but I suspected there was room for improvement. So, I decided to try loop unrolling—a technique where you manually expand the loop body to reduce loop overhead and potentially improve cache utilization.
Here’s the original code snippet:
c
void Abs(ComplexFloat *pIn, float *pOut, uint32_t N) {
float *pDst = (float *)pOut;
float32x4_t Res;
float32x4x2_t Vec;
ComplexFloat *pSrc = pIn;
for (int n = 0; n < N >> 2; n++) {
Vec = vld2q_f32((float *)pSrc);
Res = vdupq_n_f32(0);
Res = vmlaq_f32(Res, Vec.val[0], Vec.val[0]);
Res = vmlaq_f32(Res, Vec.val[1], Vec.val[1]);
Res = vsqrtq_f32(Res);
vst1q_f32((float *)pDst, Res);
pDst += 4;
pSrc += 4;
}
}
The goal was to unroll the loop to process more elements per iteration, reducing the number of loop iterations and potentially improving performance. After some experimentation, I modified the loop to process four vectors at a time. Here’s the unrolled version:
c
void Abs(ComplexFloat *pIn, float *pOut, uint32_t N) {
float *pDst = (float *)pOut;
float32x4_t Res0, Res1, Res2, Res3;
float32x4x2_t Vec0, Vec1, Vec2, Vec3;
ComplexFloat *pSrc = pIn;
for (int n = 0; n < N >> 4; n++) {
Vec0 = vld2q_f32((float *)pSrc);
Res0 = vmulq_f32(Vec0.val[0], Vec0.val[0]);
Res0 = vmlaq_f32(Res0, Vec0.val[1], Vec0.val[1]);
Res0 = vsqrtq_f32(Res0);
pSrc += 4;
Vec1 = vld2q_f32((float *)pSrc);
Res1 = vmulq_f32(Vec1.val[0], Vec1.val[0]);
Res1 = vmlaq_f32(Res1, Vec1.val[1], Vec1.val[1]);
Res1 = vsqrtq_f32(Res1);
pSrc += 4;
Vec2 = vld2q_f32((float *)pSrc);
Res2 = vmulq_f32(Vec2.val[0], Vec2.val[0]);
Res2 = vmlaq_f32(Res2, Vec2.val[1], Vec2.val[1]);
Res2 = vsqrtq_f32(Res2);
pSrc += 4;
Vec3 = vld2q_f32((float *)pSrc);
Res3 = vmulq_f32(Vec3.val[0], Vec3.val[0]);
Res3 = vmlaq_f32(Res3, Vec3.val[1], Vec3.val[1]);
Res3 = vsqrtq_f32(Res3);
pSrc += 4;
vst1q_f32((float *)pDst, Res0);
pDst += 4;
vst1q_f32((float *)pDst, Res1);
pDst += 4;
vst1q_f32((float *)pDst, Res2);
pDst += 4;
vst1q_f32((float *)pDst, Res3);
pDst += 4;
}
}
The results were pretty encouraging! The unrolled code ran about 15% faster than the original version. This improvement is likely due to reduced loop overhead and better utilization of the CPU’s vector processing capabilities.
I also experimented with reordering the load, calculation, and store operations to see if it could further improve performance. However, the initial unrolling seemed to be the most effective change. It’s a reminder that sometimes, the simplest optimizations can yield significant results.
If you’re working on similar optimizations or have tips for further improving performance, I’d love to hear your thoughts! Happy coding! ![]()