hmm, this musl-perf thing effectively makes musl LGPL, but they don't supply the source for the glibc string functions they borrowed.
one of the patches they apply changes the stdio default buffer size from 1024 bytes to 8192 bytes. why? who knows, no rationale is provided.
i guess the thinking is to align on a page boundary, but why *two* pages?
outside of adding glibc string functions and the bufsize change, they add ifunc support to the linker.
ifuncs are awful because they make program execution inconsistent across different microarchitectures
@ariadne Ifunc is just an utterly dumb way to do runtime microarch specific code selection.
@valpackett @dalias @ariadne Oooh, let me try to answer that! I think the "dumb" comes from the fact that there are other, more portable ways to implement "runtime selection of features" (i.e. function pointers stored in a protected page) that don't allow libraries to provide *arbitrary plugins* for the dynamic linker.
I wrote a whole mini-thesis on this if you are interested in getting INSANELY deep down in the weeds: https://github.com/robertdfrench/ifuncd-up
@robertdfrench @valpackett @dalias @ariadne Your benchmarks don't seem to be testing the use case ifuncs purport to improve. You're basically just showing that there is overhead in routing function calls via the dynamic linker compared to doing them direct, which is true but not particularly interesting.
For the use in glibc, they're already paying the PLT indirection cost. So the ifunc use lets them avoid a second indirection to pick the implementation.
A more useful benchmark would be to put your increment_counter() implementations in a shared library called by your benchmark harness.
@jamesh @valpackett @dalias @ariadne What do you think about this? https://github.com/robertdfrench/ifuncd-up/pull/21
Every different approach has its own libincrement that contains two different runtime-selectable increment implementations, so the cost now reflects making all of those available via the PLT.
This does not seem to change the fact that ifunc does not outperform function pointers, nor does it meaningfully outperform the worst case strategy of just checking the CPU features every single time.
@robertdfrench @jamesh @valpackett @ariadne What's been obvious to me for a long time is that, even if there were a performance advantage to ifunc, it could only be when the entire function call is so short that call overhead can be a significant portion of overall time.
On the other hand, use of a uarch-optimized variant for something like memcpy is only going to make any sense when the operation is above a certain size/time threshold.
@robertdfrench @jamesh @valpackett @ariadne This is why the proposed direction for further uarch-optimized string ops, etc. in #musl is not to have full asm functions selected at runtime, but to allow archs to provide uarch-optimized "bulk middle" operations that only get called for large operations, don't have any alignment/edge-case logic, and that get called from the generic C function only past a threshold where they could help (and where call cost is tiny %).