When dealing with embedded microcontrollers, you shouldn't be implementing filter topologies manually in your own code(assuming you will be heavily loading your CPU and don't have tons of cycles to spare). You should be using the hardware acceleration libraries for the processor which means you're limited to what they offer you.
In the case of ARM Cortex M's, the M4F and M7 have hardware DSP acceleration for filtering using the CMSIS library. For IIR filters the most processing efficient that they support is Direct Form I and II cascade biquads. If you implement your own filter topology with general software the performance will be terrible by comparison.
https://www.keil.com/pack/doc/CMSIS/DSP/html/group__groupFilters.htmlRegarding the 16-bits for transport, we're just going to have to agree to disagree. Dithering converts the what would have been truncation error (potentially audible, but still improbably anyway) into random noise. in other words, it raises the noise floor slightly. The noise floor also can only be heard (if at all) during quiet or empty passages. As soon as there is actual sound content, you can't hear the noise anymore in an overall well designed system. E.g. if the total SNR for the whole system, end to end, is better than 60dB you probably think a very high quality, silent system.
Yes, the SNR adds up from any component in the system but the vast majority comes from outside the digital audio path. In other words, accumulating some dithered noise from a dozen effects is going to to be far less than that in the analog part of the system which is subjectable to all kinda of EMI noise and interference. Your accumulated SNR in your 16-bit TRANSPORT paths are going to be probably putting the SNR between 70dB and 80dB. Your actual analog audio path as a whole is probably going to be struggling to meet 60dB SNR. Your standard passive guitar pickups are brutal for SNR from EMI, everything else almost doesn't matter at all.
Again, as above, using higher precision in certain local math calcuations is essential. Using the 32-bit floats for IIR coefficients is going to be way more accurate than 64-bit fixed-point integers but the computation will be way slower. The 64-bit integer CMSIS accelerated biquads are way faster than the 32-bit float accelerated biquads. Using as many bits as you need hear in the fastest computation possible is what good DSP designers do. Slowing down the CPU and memory transfers putting bits where they do give you must return (like as transport between effects) is not a good idea.
By using 24-bit transport when you don't need to on a compute-bound processor means you are using more more RAM, more CPU cycles moving data around, everything gets slower. This is going to potentially reduce how complex your algorithm can be as with embedded audio processing we always seem to need more compute speed. Would you rather have more accurate reverbs and distortion effects due to more compute intensive algorithms, or would you rather trade that for simpler, less advanced algorithms in order to save a few dB of SNR buried so low you can't hear them anyway?