Anteru's blog
  • Consulting
  • Research
    • Assisted environment probe placement
    • Assisted texture assignment
    • Edge-Friend: Fast and Deterministic Catmull-Clark Subdivision Surfaces
    • Error Metrics for Smart Image Refinement
    • GPUs All Grown-Up: Fully Device-Driven SpMV Using GPU Work Graphs
    • High-Quality Shadows for Streaming Terrain Rendering
    • Hybrid Sample-based Surface Rendering
    • Interactive rendering of Giga-Particle Fluid Simulations
    • Quantitative Analysis of Voxel Raytracing Acceleration Structures
    • Real-time Hybrid Hair Rendering
    • Real-Time Procedural Generation with GPU Work Graphs
    • Scalable rendering for very large meshes
    • Spatiotemporal Variance-Guided Filtering for Motion Blur
    • Subpixel Reconstruction Antialiasing
    • Tiled light trees
    • Towards Practical Meshlet Compression
  • About
  • Archive

C++ Tricks, #3: Beating your compiler at stupid stuff

May 02, 2007
  • Optimisation
  • Programming
approximately 13 minutes to read

Sometimes, you hear people say: “The compilers are all crap. Really pathetic what they do when you want them to do XYZ”. Let’s take a look at how much thruth is in that. Please note that this is just a fun example, don’t try this at home kids, it took me around one hour to squeeze 20% performance out of this function. This is absolutely not worth it, invest that hour into a better algorithm and it’ll give you much larger savings.

Rage against the machine

Let’s take this function, the dot product.

double dot_c (const size_t size,
    double* __restrict a,
    double * __restrict b)
{
    double dot = 0;
    for (size_t i = 0; i < size; ++i)
    {
        dot += (a [i] * b [i]);
    }

    return dot;
}

This being pretty stupid allows the compiler and the CPU to leverage all their hat-tricks. The memory access pattern is so simple even the dumbest prefetcher should understand it. There is no if besides the loop size. And it turns out that VC8 is even able to evaluate that function at compile time ;) (damn, that was a hard time to beat I can tell ya).

Handmade …

I’ll skip over my first hand-made version which was 170% slower than the C version above and go directly to this hand tuned masterpiece ( :) ). Look how simple and elegant it is …

double dot_sse2 (const size_t size,
    double* __restrict a,
    double * __restrict b)
{
    double dot = 0;
    __m128d acc = _mm_setzero_pd (), res = _mm_setzero_pd ();
    __m128d a1l = _mm_setzero_pd (),
        b1l = _mm_setzero_pd (),
        a1h = _mm_setzero_pd (),
        b1h = _mm_setzero_pd (),
        r1 = _mm_setzero_pd (),
        r2 = _mm_setzero_pd ();

    for (size_t i = 0, eights = size / 8; i < eights; ++i) {
        a1l = _mm_loadl_pd (a1l, a+8*i), b1l = _mm_loadl_pd (b1l, b+8*i);
        a1h = _mm_loadl_pd (a1h, a+8*i+1), b1h = _mm_loadl_pd (b1h, b+8*i+1);

        r1 = _mm_mul_sd (a1l, b1l);
        r2 = _mm_mul_sd (a1h, b1h);

        acc = _mm_add_sd (r1, r2);
        res = _mm_add_sd (res, acc);

        a1l = _mm_loadl_pd (a1l, a+8*i+2), b1l = _mm_loadl_pd (b1l, b+8*i+2);
        a1h = _mm_loadl_pd (a1h, a+8*i+3), b1h = _mm_loadl_pd (b1h, b+8*i+3);

        r1 = _mm_mul_sd (a1l, b1l);
        r2 = _mm_mul_sd (a1h, b1h);

        a1l = _mm_loadl_pd (a1l, a+8*i+4), b1l = _mm_loadl_pd (b1l, b+8*i+4);
        a1h = _mm_loadl_pd (a1h, a+8*i+5), b1h = _mm_loadl_pd (b1h, b+8*i+5);

        acc = _mm_add_sd (r1, r2);
        r1 = _mm_mul_sd (a1l, b1l);
        r2 = _mm_mul_sd (a1h, b1h);
        res = _mm_add_sd (res, acc);

        a1l = _mm_loadl_pd (a1l, a+8*i+6), b1l = _mm_loadl_pd (b1l, b+8*i+6);
        a1h = _mm_loadl_pd (a1h, a+8*i+7), b1h = _mm_loadl_pd (b1h, b+8*i+7);

        acc = _mm_add_sd (r1, r2);
        r1 = _mm_mul_sd (a1l, b1l);
        r2 = _mm_mul_sd (a1h, b1h);
        res = _mm_add_sd (res, acc);

        acc = _mm_add_sd (r1, r2);
        res = _mm_add_sd (res, acc);
    }

    for (size_t i = 0, r = size % 8; i < r; ++i) {
        res.m128d_f64 [0] += a [size-i] * b [size-i];
    }

    return res.m128d_f64 [0];
}

Basically, the trick is to mix computation with memory access and shuffle the instructions enough around that the dependencies can be satisfied during the memory access. Makes use of all 8 XMM register.

Benchmark!

There is only one result: Compiler-C-Version: 1.25 sec worst, 1.18 best, my version: 1.0 worst, 0.9 best. Read: I’m up to 35% faster! Tested on Vista x86 @ Turion 1.8 GHz with 2000 iterations over a 65536-element vector.

Conclusion

While it is usually fun and enjoying to do some low level optimization, and it can even give some nice results, it’s usually not worth the hassle. In 99.9% of real-world cases the compiler will create much better code than you can ever hope to write, and unless a profiler tells you that a particular function is a huge bottleneck, don’t even think about doing such low-level tuning, and even then think about better algorithms first before doing this kind of stuff.

    Previous post
    Next post

    Recent posts

    • Data formats: Why CSV and JSON aren't the best
      Posted on 2024-12-29
    • Replacing cron with systemd-timers
      Posted on 2024-04-21
    • Open Source Maintenance
      Posted on 2024-04-02
    • Angular, Caddy, Gunicorn and Django
      Posted on 2023-10-21
    • Effective meetings
      Posted on 2022-09-12
    • Older posts

    Find me on the web

    • GitHub
    • GPU database
    • Projects

    Follow me

    Anteru NIV_Anteru
    Contents © 2005-2025
    Anteru
    Imprint/Impressum
    Privacy policy/Datenschutz
    Made with Liara
    Last updated February 20, 2019