This is great, but the context is that this is for specific inner loops, and it is compared to the C version of that specific inner loop. Typically what was used before this on a computer with avx512 was the avx2 version of the inner loop, and the speedup compared to that version appears to be up to 60%: . Then as not a specific inner loop isn’t run all the time, the speedup is probably much less than 60%. This is still sizeable, but the actual speedup in practice with this implementation is far far from 94x.
All the better for us running Linux!