Blog Archives

 2008 →
Months
AugSep
Oct Nov Dec

Links

Kevin
Charles
Thatcher
Aaron
Ryan
Ignacio

Cache Misses: Bad

So that's not a particularly controversial statement, but it really took on new meaning this afternoon. While plowing through Granny in the Xbox 360 profiler, I noticed something rather improbable. A function that did essentially no computation, just a loop and a few conditional dispatches, was taking in excess of 10% of the time in one of our critical sampling paths. Odd. Mispredicted branches are bad, but that bad? So, after setting up a decent test case and stuffing it through the profiler, one number jumped out. In 100 calls to the function, it was missing the L2 data cache 22000 times.

22 f***ing thousand cache misses!!!

It's really hard to overstate just how bad that is. Granny is by it's nature somewhat non-local in it's memory accesses, but that's simply ridiculous. 200+ cache misses per call? A handy rule of thumb for the time taken by L2$ misses on the X360 is "5000 misses ~= 1 millisecond". That's pretty close on the PS3 as well, assuming that the SPUs aren't banging on the memory bus at the same time. So the function was wasting around 4.5 milliseconds per frame just waiting around for memory to be delivered.

The good news is that the problem was really easy to fix. Thrown in a few prefetches, and the cache misses dropped to essentially zero. (~600 per 100 calls, which is totally acceptable.) Throw a few more around, and the application which was missing L2$ 44k times per frame now drops to 12k. That's a savings of 32k L2$ misses ~= 6 milliseconds. Per frame! 12k cache lines is roughly 1.5 megs on the 360, which is close to the working set size of this particular stress test. Much better. Five lines of prefetches sped up the test app by 20%.

The whole episode points to part of the fun (and the pain) of developing for multiple platforms at the present moment in the gaming technology curve. If you don't keep a close eye on the strengths and weaknesses of each platform, you are absolutely doomed. Those cache misses never showed up as a problem on the x86 platform where I do most Granny development, since the out-of-order core hid most of the memory latency. The current generation of Intel and AMD chips also have ninja predictive prefetch hardware, which Microsoft chose not to include in the PPC-based Xenon cores. The PS3 does have a basic predictive prefetcher, but it's not nearly as intelligent as a modern out-of-order core.

Now I just have to figure out how to do something sensible with the SPUs on the PS3...

Unforced Error

My god, the compiler tools included with Visual Studio 2005 are a flaming bag of poo. How is it possible that a business built on programming can deliver such crappy tools? And the bit that's really getting under my skin right now is technology that was old and boring in 1970: command-line compilers and linkers. Witness:

Exhibit A. Manifests. Holy mother of god, what retard designed and implemented this "feature"? Here's the synopsis: each executable must have an XML resource associated with it that specifies (in part) which libraries it was linked against. So you can't distribute your application without dealing with this crud. The first sign that this is not going to be a pleasant experience is that the resource must have an ID of 1, unless you're compiling a DLL, in which case it must have an ID of 2. Good work, guys. The .manifest file is spit out by the linker, which makes some sense, but the linker doesn't contain a switch to simply stuff the damn thing into the linked file! The simple way to deal with this is to call the new manifest tool with commandlines that look like:

mt.exe -manifest some.exe.manifest -outputresource:some.exe;#1

Yeah, that's hot! ";#1"? Who the hell makes commandlines that look like that? The second way to handle this is to wait for the linker to spit out the .exe.manifest file, create a custom .res file that contains just that manifest as a resource, compile it with the resource compiler, and then link the whole thing again! Awesome!

Look, I get it. DLL hell is no fun. The install program from Gramma's recipe database trys to install the C runtime DLL from 1982 into system32. What I don't understand is why I'm being punished because Microsoft managed to screw up their access controls in the 90s. Guys: if you want to do this, add a linker switch that just does the f'ing right thing! It's not hard. The steps for a normal application are entirely deterministic. You can maintain the above lovely pieces of tech for people that are doing "innovative" things with your environment. By the way, this page is the most informative bit of documentation that I could find on this topic at MSDN. I don't know whether to laugh or cry. NMAKE is not a real build program, just document the damn switches, OK?

Exhibit B. Smarmy deprecation warnings for the C standard library functions. Attention: just because you have problems writing more than 10 lines of code without creating a remote-root buffer overflow exploit doesn't mean that I do. Why do I have to define _CRT_SECURE_NO_DEPRECATE to turn these things off? Shouldn't I have to define _HOLD_MY_HAND_PLS_THX to turn them on? You don't get to deprecate the standard library!

Exhibit C. No inline assembly for x64 targets? I know that we're supposed to pretend that the CPU doesn't exist now and code to the CLR, but hey, that doesn't work. Let me access RDTSC, please. They even removed the _emit keyword, so you can't work around the problem without building a full .asm file for MASM, and incurring function call overhead. Lovely.

Exhibit D. When I start the IDE for VC, the start page goes and loads a feed for important developer news. Good to know that time is spent on the things that matter.

Update: I just gave up on the DLL crt, and linked statically. That's probably the way to go, anyways; who the heck can predict what they're download to a users machine from WindowsUpdate.