Convolution reverb does not work with the reverse of the impulse - that would be reverse convolution reverb.
Most convolution reverbs use frequency domain techniques, as convolution in the time domain corresponds to multiplication in the frequency domain. To have longer impulses with low latency, the standard technique is to use sectioned overlap/add convolution. Many of the plugins make use of techniques similar to those in the Lake DSP low-latency convolution patent, where the impulse is divided into unequal length sections that are overlapped and added together.
A high powered DSP would be able to perform convolution, no problem. The Tonecore development kit uses a Freescale DSP56364, running at 100 MHz, which would probably not be powerful enough. You would want to use a SHARC 21369, TI 6713/672x, or TigerSHARC to run a convolution reverb. All of these processors are optimized for running very fast FFTs, which is what you are looking for.
Another option: create your own hardware using a FPGA. You could implement a brute-force FIR based convolver, a processor optimized for FFTs, or a combination of the two.
Personally, I think that convolution reverbs are a waste of computational resources, considering how much cheaper it usually is to run a Lexicon-style allpass loop on the same hardware. In some cases, it might be a toss-up as to which algorithm makes the most sense. Systems with fast FFTs and DMA-based memory access (i.e. slower access to a random sample, but pretty quick at bringing in blocks of data once the desired start location has been reached) are good candidates for convolution reverb. My theory is that convolution reverbs are appealing, in that people feel that they can "sample" a space, and the algorithm does not require the secret sauce of the professional algorithmic reverbs. However, a convolution reverb is unable to emulate the time-varying characteristics of real-world acoustic spaces, uses variable amounts of resources for different impulse lengths (an algorithmic reverb uses the same CPU for a .1 sec T60 as a 100 sec T60), and requires a huge amount of memory and MIPS compared to an algorithmic system.
Sean Costello