Cache Misses: Bad
So that's not a particularly controversial statement, but it really took on new meaning this afternoon. While plowing through Granny in the Xbox 360 profiler, I noticed something rather improbable. A function that did essentially no computation, just a loop and a few conditional dispatches, was taking in excess of 10% of the time in one of our critical sampling paths. Odd. Mispredicted branches are bad, but that bad? So, after setting up a decent test case and stuffing it through the profiler, one number jumped out. In 100 calls to the function, it was missing the L2 data cache 22000 times.
22 f***ing thousand cache misses!!!
It's really hard to overstate just how bad that is. Granny is by it's nature somewhat non-local in it's memory accesses, but that's simply ridiculous. 200+ cache misses per call? A handy rule of thumb for the time taken by L2$ misses on the X360 is "5000 misses ~= 1 millisecond". That's pretty close on the PS3 as well, assuming that the SPUs aren't banging on the memory bus at the same time. So the function was wasting around 4.5 milliseconds per frame just waiting around for memory to be delivered.
The good news is that the problem was really easy to fix. Thrown in a few prefetches, and the cache misses dropped to essentially zero. (~600 per 100 calls, which is totally acceptable.) Throw a few more around, and the application which was missing L2$ 44k times per frame now drops to 12k. That's a savings of 32k L2$ misses ~= 6 milliseconds. Per frame! 12k cache lines is roughly 1.5 megs on the 360, which is close to the working set size of this particular stress test. Much better. Five lines of prefetches sped up the test app by 20%.
The whole episode points to part of the fun (and the pain) of developing for multiple platforms at the present moment in the gaming technology curve. If you don't keep a close eye on the strengths and weaknesses of each platform, you are absolutely doomed. Those cache misses never showed up as a problem on the x86 platform where I do most Granny development, since the out-of-order core hid most of the memory latency. The current generation of Intel and AMD chips also have ninja predictive prefetch hardware, which Microsoft chose not to include in the PPC-based Xenon cores. The PS3 does have a basic predictive prefetcher, but it's not nearly as intelligent as a modern out-of-order core.
Now I just have to figure out how to do something sensible with the SPUs on the PS3...