As audio can also be represented as images by transforming to a spectrogram (picture above) a diffusionmodell (like Stable Diffusion↗︎ or Midjourney↗︎) is trained on a set of spectrograms that have been generated from a directory of audio files. It is then used to synthesize similar spectrograms, which are then converted back into audio.
You can play around with some pre-trained models on Google Colab or Hugging Face spaces. Check out some automatically generated loops here.