Binary size can have an indirect effect on performance due to cache usage.
I wrote this bit of code to investigate if compilers would fold identical template code (in this case, the sorting of pointers).
#include <cstdio>
#include <cstdlib>
#include <algorithm>
#define CAST 1
template<typename T>
void f() {
std::size_t n = 1024;
T** v = (T**)std::malloc(n * sizeof(T*));
for(std::size_t i = 0; i < n; ++i) {
v[i] = (T*)std::malloc(sizeof(T));
*v[i] = T{};
}
#if CAST
std::sort((void**)v, (void**)(v + n));
#else
std::sort(v, v + n);
#endif
std::printf("%p, %p\n", v[0], v[n-1]);
for(std::size_t i = 0; i < n; ++i) {
std::free(v[i]);
}
std::free(v);
}
int main() {
std::printf("CAST = %d\n", CAST);
f<double>();
f<float>();
f<int>();
f<short>();
f<char>();
f<long>();
f<unsigned>();
f<unsigned long>();
}
I compile it with -O3
and then strip
the binary. Interestingly, the generated binary size does depend on whether the pointers are cast to void*
before being sorted. For Clang++
especially, the binary is much smaller.
CAST | G++ 12 | Clang++ 15 |
---|---|---|
false | 22KB | 35KB |
true | 18KB | 19KB |
This is odd, because one would think that if sort
is inlined, the generated machine code should not depend on CAST
. But if it’s not inlined, then the compiler should notice that a bunch of sort<>
functions have identical machine code and fold them into one.
But apparently, this does not happen. I wonder why?
You need to sign in to view this answers
Leave feedback about this