CPU cache -- old optimization skills won't work any more
This article is not explaining what CPU cache is, how CPU cache works. It's only to show you why some old optimization tricks won't work well.
The discussion here is focusing on native programming languages, such as C, C++. For non-native languages such as Java, .Net, the discussion may be not true.
Why are the dated optimization skills still in my (and maybe yours) mind?
- Because I developed in J2ME and Flash for many years. Those devices are far from modern CPU and the old skills worked quite well.
- Because my CPU, assembly language, optimization knowledge, stopped at 80386 era. :)
Dated skill 1, Use pre-calculated look up table and variables
Now let's see how to use look up table to calculate bits count in a 32 bits integer.
static const unsigned char BitsSetTable256[256] =
{
// pre-calculated bits count in an 8 bits integer
};
int calculateBitsCount(unsigned int n)
{
unsigned char * p = (unsigned char *)&n;
return BitsSetTable256[p[0]] +
BitsSetTable256[p[1]] +
BitsSetTable256[p[2]] +
BitsSetTable256[p[3]];
}
Really cool, yes? Only four "add" operations.
But with CPU cache, it's totally different. When function calculateBitsCount fetches BitsSetTable256 the first time, most likely the data cache needs to be empty and reload from BitsSetTable256, which thus will waste hundreds of CPU cycles. Hundreds of CPU cycles are more than enough to calculate the bits count without a look up table.
For example, the algorithm without look up table, from,
http://graphics.stanford.edu/~seander/bithacks.html
unsigned int v; // count the number of bits set in v
unsigned int c; // c accumulates the total bits set in v
for (c = 0; v; c++)
{
v &= v - 1; // clear the least significant bit set
}
Looks like it has more instructions and loops than look up table, but keep in mind that instruction is quite cheaper than data (unless there are too many instructions to fit in instruction cache). Indeed according to my test, the algorithm is faster than the look up table one.
Dated skill 2, Prefer local variables to object data member
Note that this skill is still efficient in most conditions, here is only explaining that sometimes it will be slower.
Assume we have a function,
void func(SomeObject * obj)
{
int i, k, p;
int count = obj->getCount();
for(i = 0; i < 100; ++i) {
for(int k = 0; k < 100; ++k) {
for(int p = 0; p < count; ++p) {
// prcess obj
}
}
}
}
And assume function getCount is only a simple getter function.
func looks quite good. And if all local variables are put in registers, it will work perfectly. But how about if "count" can't be put in register? Then in each loop "count" has to been fetched from the stack memory, and in the meantime "obj" is processed. Most likely "count" and "obj" can't be in one data cache, then in each loop data cache will be always reload, slow!
If we throw away the variable "count", and change the inner loop to,
for(int p = 0; p < obj->getCount(); ++p) {
// prcess obj
}
Even if getCount is not inlined by the compiler, since the data is in block same as "obj", if "obj" can fit in data cache, then no cache miss, it will be quite fast.
Dated skill 3, Do all things in one loop
In my old mind, I always believe loop is slow, especially loop has to jump between instructions. So I preferred below code,
ObjectA * objA;
ObjectB * objB;
for(int i = 0; i < 100; ++i) {
// do something on objA
// do some other thing on objA
// do something on objB
}
Considering about CPU cache, there are two problems with the code,
- In case the instructions in the loop body exceed instruction cache, the instruction cache has to be empty and reload in each loop, there is disaster of performance.
- And even worse, the loop body processes memory in two memory blocks (objA and objB), unless objA and objB are allocated conjointly, the data cache has to been refresh in each loop and cause cache miss.
If we split the loop,
ObjectA * objA;
ObjectB * objB;
for(int i = 0; i < 100; ++i) {
// do something on objA
}
for(int i = 0; i < 100; ++i) {
// do some other thing on objA
}
for(int i = 0; i < 100; ++i) {
// do something on objB
}
Then both instruction and data cache will be happy, and are willing to work faster.
Please note, though all above skills are labeled as "dated" by me, that doesn't mean they are wrong. They are still true under a lot of circumstances.
What we should know is that they are not always true on modern CPUs. Keep this in mind when optimization for performance.