Implement new CLZ and CTZ primops (re #9340)

This implements the new primops

clz#, clz32#, clz64#,
ctz#, ctz32#, ctz64#

which provide efficient implementations of the popular
count-leading-zero and count-trailing-zero respectively
(see testcase for a pure Haskell reference implementation).

On x86, NCG as well as LLVM generates code based on the BSF/BSR
instructions (which need extra logic to make the 0-case well-defined).

Test Plan

validate and succesful tests on i686 and amd64

Does it also make sense to offer a 16-bit version or is there no corresponding instruction?

If there's use, 16bit variants can be offered easily as well. See to get an idea what's supported at the HW level.

For the __builtin_c{t,l}z() fallbacks I'll have to workaround a bit, as those aren't provided for smaller types than int, bit the LLVM codegen happily supports generating code for 16bit and 8bit llvm.ct{l,t}z.*

Will you also add the relevant things to Data.Bits?


I think it makes sense to support all hardware supported sizes. For example, unordered-containers happens to use 16-bit wide bitmasks. I doesn't need this particular instruction but you could imagine a case where it was needed and then it would be a shame to not have it.

Implement also 8-bit and 16-bit variants; make unit-test smaller/faster

I just tested this on Solaris 11.1 and bootstrapping went fine and gmake TEST=T9340 also passes. Good work!

