Those tiny time delays are going to put some notches and peaks into the frequency response, especially at higher frequencies (where the wavelengths are shorter). Subtle movement of your head would move those peaks and notches, giving a more "alive" sound - basically a mild phaser gently applied to your signal. By the time you're boogieing about in front of the speakers, the phase shifts will be pretty significant!
Since this effect can only occur when you've got multiple sound sources producing the same signal, it's no surprise that a 4 x12 sounds better (more interesting?) than a 1 x12.
It has a fairly strong effect on the response.
If you just consider these things you get some pretty extreme looking frequency responses:
- the phase-shift (time delay) due the different distances
- the slightly different levels due the different distances
- the reduction in high frequencies due to the off-axis response of speaker
There's some nice pics of off-axis responses here,
http://rutcho.com/speaker_drivers/supravox_t215_rtf_64/supravox_t215_rtf_64.htmlThis is for a 21cm speaker. Image shifting the curves down in frequency by a factor of (21/30=0.7) for a 12" driver. If you are standing close to the speakers the angle from the ears to the driver axis can be fairly significant, especially to the driver down low on the ground. You can see from those curves that the high frequencies are strongly attenuated off-axis. For an array no many speakers are contributing to the high frequencies. When you are far from the speaker the angle becomes less and you get more highs.
One caveat: off-axis mic'ing will give a much different response to these distant off axis plots. The mic is in really close, in the near-field, and the response there is quite different to the far-field response.
Of course in a real room the situation is more complex. You also have the response of your ear to consider. You ears will enhance the high frequencies for sounds approaching from the sides, which undoes the speaker roll-off a little bit.
I looked at this stuff about 20 years ago. I've got a model which sums up the response based on the items listed above. It include a model for the off-axis response as well. You basically specify the positions of the drivers and the listening position and it sums up the pressure responses. It assumes a flat response driver so you have to multiply the result by the response of the driver. I could also add a reflection from the floor. The result was a very unintuitive looking response and *much* larger changes to the response than you would expect. If I get time I'll pull out this stuff post a plot.
This is a set-up with all drivers spaced 350mm between centers. The listening position is at 3m away from the speakers, the ear around head height and a fraction off the centre in the left/right sense. The speakers are on the ground (first driver center about 160mm from base) and build up higher and wider as more drivers are added.

What's not included: no floor or wall reflections, no modelling of diffraction loss from the enclosure, no ear response mods.
There is a hack in the off-axis response to prevent the attenuation going too low.
The responses are normalized around 800Hz.
Just to emphasize the driver has a flat response. The response of the guitar speaker would add (in dB) to what is shown. So the rise in the response you see on a guitar speaker at high frequencies would raise the level of the highs.