hmm, this musl-perf thing effectively makes musl LGPL, but they don't supply the source for the glibc string functions they borrowed.
one of the patches they apply changes the stdio default buffer size from 1024 bytes to 8192 bytes. why? who knows, no rationale is provided.
i guess the thinking is to align on a page boundary, but why *two* pages?
outside of adding glibc string functions and the bufsize change, they add ifunc support to the linker.
ifuncs are awful because they make program execution inconsistent across different microarchitectures
@ariadne Ifunc is just an utterly dumb way to do runtime microarch specific code selection.
@dalias yeah, i agree. it would be nice to have some of those AVX string implementations though in musl.
@ariadne Possibly. We have a tentative roadmap for a reasonable way to do that involving nothing like ifunk.
@valpackett @dalias @ariadne Oooh, let me try to answer that! I think the "dumb" comes from the fact that there are other, more portable ways to implement "runtime selection of features" (i.e. function pointers stored in a protected page) that don't allow libraries to provide *arbitrary plugins* for the dynamic linker.
I wrote a whole mini-thesis on this if you are interested in getting INSANELY deep down in the weeds: https://github.com/robertdfrench/ifuncd-up
@robertdfrench @valpackett @dalias @ariadne
x86 microarchitecture levels [ https://www.phoronix.com/news/GCC-11-x86-64-Feature-Levels ] would probably address 90% of IFUNC uses, I wish we could get that rolling more... (And maybe have an ARM64 equivalent?)
But distro & package support seems spotty on this for now :'(
@robertdfrench @valpackett @dalias @ariadne (to be clear this means shipping 5 copies of binaries, i.e. it does also dissimilarize what code is running - and thus a new factor in bugs - but at least you can quite easily tell what you ended up with. And switching to another variant is just moving files around.)
@equinox @valpackett @dalias @ariadne This is a great solution for systems that are installed and operated on the same hardware, but VM & Container images have to boot without guidance from the package manager (until SystemD grows its own package manager, which it should!)
@robertdfrench @valpackett @dalias @ariadne that's not how this works, all 5 binaries are part of the same package; the dynamic linker chooses which one to load at program start. They're in different subdirectories under /lib. (...needs work for /bin...)
Of course the package is then 5x in size for binaries, which depending on your use case can be anywhere from irrelevant to a dealbreaker.
@equinox @valpackett @dalias @ariadne oh you want the linker making the choice? Yeah, I could get behind that. You could go even further and mark symbols in the same binary as being variants for each micro-architecture, and then let the linker assemble it based on its own feature detection decisions. Like if ifunc were a table rather than ARBITRARY CODE.
I endorse this solution wholeheartedly.
@robertdfrench @equinox @valpackett @ariadne The linker doesn't even need to make the choice. The system can just be configured to symlink the ones to a tmpfs or bind mount them over the default baseline-portable ones or add a directory to the path search file as appropriate for the running hardware.
This is why #musl does not (and won't) have uarch-optimization-resolving logic in ldso. It's easily factored to a better policy layer.
@dalias @robertdfrench @valpackett @ariadne this conveniently also works for /bin, it's just... "less obvious"... where to put the uarch subdirs. Not that it needs a huge standard or anything.
Really just a question of build and packaging.
P.S.: I'm really eager on this because I have good reasons to want POPCOUNT. Which is only missing on the very oldest x86_64 CPUs :'(
@dalias @equinox @valpackett @ariadne yeah okay, I'll allow it. That approach would give a lot more administrative visibility anyways, since you could just run `mount` instead of having to query the linker for its decisions.
However... we do already have the expectation that you can query the linker for how it would resolve dependencies. So if it can't give you the "whole" picture, that might confuse folks.
@robertdfrench @valpackett @dalias @ariadne Your benchmarks don't seem to be testing the use case ifuncs purport to improve. You're basically just showing that there is overhead in routing function calls via the dynamic linker compared to doing them direct, which is true but not particularly interesting.
For the use in glibc, they're already paying the PLT indirection cost. So the ifunc use lets them avoid a second indirection to pick the implementation.
A more useful benchmark would be to put your increment_counter() implementations in a shared library called by your benchmark harness.
@jamesh @valpackett @dalias @ariadne yeah I have been feeling a little unsure about those tests for a while. Let me take another crack at them at see what comes out.
@jamesh @valpackett @dalias @ariadne What do you think about this? https://github.com/robertdfrench/ifuncd-up/pull/21
Every different approach has its own libincrement that contains two different runtime-selectable increment implementations, so the cost now reflects making all of those available via the PLT.
This does not seem to change the fact that ifunc does not outperform function pointers, nor does it meaningfully outperform the worst case strategy of just checking the CPU features every single time.
@robertdfrench @jamesh @valpackett @ariadne What's been obvious to me for a long time is that, even if there were a performance advantage to ifunc, it could only be when the entire function call is so short that call overhead can be a significant portion of overall time.
On the other hand, use of a uarch-optimized variant for something like memcpy is only going to make any sense when the operation is above a certain size/time threshold.
@robertdfrench @jamesh @valpackett @ariadne This is why the proposed direction for further uarch-optimized string ops, etc. in #musl is not to have full asm functions selected at runtime, but to allow archs to provide uarch-optimized "bulk middle" operations that only get called for large operations, don't have any alignment/edge-case logic, and that get called from the generic C function only past a threshold where they could help (and where call cost is tiny %).
@robertdfrench I'd suggest structuring your benchmarks so the choice of implementation is made on the library side in all situations: currently it is done on the library side in some cases and in the benchmark program side in others.
Ideally the benchmark harness would be identical in each case, perhaps even letting you swap one library implementation for another.