Further improvements can be made if there's a way to have a union struct (4 u16s, lumped u64) that has a simple 4bit diagonal mirror operation on it. Can be at least twice as fast.
Benchmarking with optimizations shows about 80-90% of the prior time taken, so at least a 10% speed optimization
Rarely used, was fun to try and optimize a little more.
Eliminate bounds checks by accessing/setting the highest element, and only index twice instead of 6x.
Eliminate u16 casts by leaving as int type (same result)
Eliminate temp value caching and instead directly write to storage. (no more _0123).
End result looks neat too, since the >> 0's removed looks like a diagonal, like the nibble rotation :D