Thursday, December 15, 2022

Q_rsqrt

I stumbled on float Q_rsqrt(float) on some youtube or other. It is the reciprocal square root; done especially to normalise a vector to unit-size (Pythagoras, yo!) This is much-beloved of ray-tracers in such games as use the Quake III engine. Like “Alice”.

The mathematician in me agrees to use the word “reciprocal” against “inverse”. “Inverse” implies the inverse-function which for a root would be the, er, SQUARE. The insolation h which various planets at a AU get from the Sun tends (as the sun's disc recedes to a point) to reciprocal-square, k/a2; k, at this blog for 1 AU, being 1361 or 1380 W/m2 or whatever. We’re talking the reciprocal-root which, for this equation, would be the true inverse. It’s how we’d figure a: √[1361/h] = 36.89/√h. So: “inverse” is meta, “reciprocal” is functional.

Getting /√ (or ^-.5 for us oldschool Microsoft nerds) is slow. I don’t care if I’m only doing this to get a one-off calculation done; but gamers care, because they’re trying to shoot a demon around a corner, not do a lot of trig. The Napier family would suggest we stuff the memory-card full of natural-logs and do a lookup. I understand that’s how computers generally did the Long Division and various roots: you’d get arbitrarily-good accuracy, dependent on that memory-card. In the 1990s power and memory both were lagging.

As to how the lookups were done; maybe binary-search O(ln n), to get a double-precision number into a less-precise bound for the lookup. More likely they just chopped it to something as could get mapped in a few hash-iterations to Napier's table. This post doesn't much care.

“Luckily” the graphical resolution of the day was also poor. Monitors were predominant CRT anyway so few cared about the resolution; especially if frame-rates be quicken'd enow to skip past notice. Some programmers wondered if they could get this all done faster than the lookup toward a memory-hogging table, with an acceptable trade in accuracy.

More-luckily it happens that a square root is base two. We're on a computer. We're binary.

So here’s the basics, for that incoming four-byte signed floating-point number. You set a local “y” to that. You then pluck the binary out from the pointer, you C monster you:

let i = * ( long * ) &y;

Turns out, that's log2y. Then comes the literal WTF - 0x5f3759df - ( i >> 1 ). That >> is a binary "rightshift"; chopping by factors of two is what computers do. So this just represents WTF + -log2y /2. As for that hexademical WTF against which this halved integer has been subtracted ... we'll get to that.

After reversing the resultant integer for the new y = * ( float * ) &i; it turns out not to be perfectly accurate. Thence you run some newton-raphson and GTFO with it.

We somewhat cheated: we could have started with almost anything and done the newton-raphson. But the need was for speed. You’ll be iterating that last as few times as possible; no for-loops or bounds-checking here. In the latest 1990s, they did it once, with a second run commented-out. Since then the algo has been put into instruction-sets so that the PC can do this closer-metal: rsqrtss. I wonder if System.Numeric’s Vector3 (float-based) is using it...

With our better monitors, which will show you double-point precision, Toru Niina has reported some sketchiness in practice; and our memory and RAW POWER are better too, so Napier isn’t looking too bad anymore. The fast /√ may not have lived very long, mid1970s-mid2000s if we're generous; but - like the span of Pluto inside Neptune - that's most of my life. [WHY NOT 9/29/23: Compilers don't assembler anymore.]

Toru Niina hopes to salvage this lost spark of Clinton-era genius. His perhaps-obvious first-pass was to implement that second newton-raphson, in RUST. But this slowed down the process such that it no longer competed with the old 1/sqrt(x).

Alternatively given a limit of one newton-raphson spin, we continue that arcane-art of picking the startpoint. That’s a tweak to the WTF hexadecimal 0x5f3759df. Consensus runs that it be too low, with alternatives ranging from higher 0x5f375s even to low 0x5f376s. But never mind all that: why were id software ever using signed LONG and FLOAT? Square-roots don't do negatives. Make them unsigned, double your accuracy. Post-Pentium, when "long" isn't Int32/4-byte but Int64/8-byte, a "double" of 8 bytes should be as fast as the 4 byte "float" once was.

No comments:

Post a Comment