NPO2 slope span optimization
I don't understand enough of slope drawing to remove the modulo operations in non-power-of-two slopes, so I instead optimized them using libdivide. (https://libdivide.com/) That library (contained in one header file) speeds up division (and modulo) when the same divisor is used multiple times. I also reduced the amount of modulo operations per pixel from 2-4 to always 2. The functions are now 1.5x - 3x faster.
Screenshots of best improvement scenario:
Before:
After: