I was reading through Arm64 instructions and saw that some instructions (e.g. CTZ) are optional and become mandatory with some extensions (e.g. Armv8.9 for CTZ).
I am now wondering about portability of using such instructions. If I am compiling some C code to use it on my machine, then the compiler could check what my hardware actually supports.
What if I want to pre-compile some binaries for a release of a program? Do compilers usually refrain from using too recent extensions based on some statistical knowledge of CPUs and their implemented instruction sets? Or is there some secret compatibility trick allowing to check support at runtime that I am unaware of?
(For clarification, I know CTZ can be replaced by RBIT and CLZ and is probably not that big of a deal, I am just wondering about a more general case.)
Answer
Compilers use the ISA feature-set you tell them they can assume for the target. e.g. -march=armv8.9-a
, or an ISA level plus some specific features like -march=armv8.9-a+sve
. For ARM / AArch64 (but not most other ISAs), GCC and Clang have a -mcpu
option (e.g. -mcpu=cortex-a720
) which implies everything that core has, and to tune for it.
For other ISAs like x86-64, -march
takes CPU names. -march=znver4
for example implies all features of Zen 4 and -mtune=znver4
. (More recently, there are x86-64 feature levels like -march=x86-64-v3
which is AVX2 + FMA + BMI1/2 + some more obscure stuff that's still widespread, like Haswell had. But not Intel-only things like TSX transactional memory.)
The default target config (if you don't use any options) is often quite old, either the earliest for the ISA (like ARMv8.0 or first-gen x86-64), or for 32-bit x86 for example is usually configured with i686 (Pentium Pro) as the default baseline, not 386, so cmov and stuff is available. (And more recently, distros often configure compilers to use SSE2 by default for 32-bit x86).
If you want the compiler to check your hardware and make a binary that uses everything it has, use Clang or GCC -march=native
. That is not the default.
It's also possible to write code that checks features at run-time and dispatches to different versions of a function. e.g. to take advantage of CPUs with different SIMD features, or in your example a scalar loop that's slightly faster with ctz
instead of rbit
/clz
. This has overhead and of course can't inline, so wrapping just clz
would defeat the purpose, you need to multiversion a function that has a loop.
This can be done fully manually with an array of function pointers you init at program startup, or with some help from the compiler like GCC ifunc
stuff where you use __attribute__((target("whatever")))
on different definitions of the same function.
This answer is primarily about GCC and Clang; other compilers will have different names for their options, but the basics are generally similar.