However, if I would switch to a square wave (which has much more harmonics), the LFO would extend into audio range... As I'm modulating delay times (and hence pitch), doesn't this cause audible FM distortion?
Of course. Isn't that kind-of the point?!? It's not "distortion", it's "modulation".

Also, a different (but related) question: I've read that this form of modulation can cause a tempo offset... How can this be solved?
If the delay time changes, then the "tempo" of the delay changes. It doesn't change the tempo of the incoming signal. This gives a strange effect. If you modulate a PT2399's delay time / pitch with a slow sine or triangle wave, it sounds like everything is speeding up and then slowing down again, but in fact the underlying tempo doesn't change. Just the pitch and tempo of the echoes.
It's not the same as the synth "FM versus PM" case, where one can cause a pitch offset and the other doesn't. The maths is not the same.
Lastly, I've seen people on the internet (such as https://github.com/ElectricCanary/Bontempo) that can 'calibrate' the pedal but I can't figure out how this works... Can anybody explain it to me? 
Anything that *doesn't* use the pin 5 "clock output" on the PT2399 (or alternatively the audio input and output) isn't "calibrated" in any meaningful sense. It's an "open loop" solution, which is to say, it sets a value and hopes the chip responds according to the datasheet or its own internal assumptions. Neither of which may be the case.
In order to "close the loop" and actually calibrate the chip, you have to set a value and then measure the result. That can be done by setting the pin 6 value and then measuring the clock at pin 5. Or you can set the pin 6 value, and then ping a signal through the delay and measure how long it takes to arrive at the output. The pin 5 output on the PT2399 is a horrid waveform and needs cleaning up, but it's still easier than the second solution. The "calibration" comes when you adjust the values you put out based on the results that you saw. Without this, there is no calibration.
The Bontempo link you posted puts *the human* in the loop, and gets you to adjust the delay until it matches the flashing tempo LED. It then stores the offset that was required for that delay value, so it can repeat it next time. It says that it does this ten times, so I would expect ten values spread across the full range of delay times it can handle. The offsets for in-between delay times can then be worked out by interpolation, which should be reasonably accurate. It's not the greatest, but if you're careful with your calibration, it should be ok. Still, it depends on *your* accuracy, not *its*, which seems like a design weakness to me.
HTH,
Tom