I suspect these are using additional tools to guide the AI beyond a simple prompt. For example the spiraling medieval village was generated with stable diffusion and controlnet.
I think the prompt is not much other than "puppies" and "kittens". Major, middle and minor features of the image can be controlled individually in some AIs (they can be differentiated using a Fourier transform or Gauss convolutions and fed into different discriminators) so I think:
major features (scenery) are controlled by the prompt (grass or couch)
middle features (text) are a source image that the AI is punished for straying from
minor features (details) are controlled by the prompt (faces and fur)
Or it's just Stable Diffusion that starts with a text rather than random noise.
Not sure what this prompt was, since I didn't make this one. I linked the site I used above, and they're pretty simple to do but need a few tries to get a good one.
Stable Diffusion together with Controlnet. You basically feed it the text as a black and white image and provide it with a description of the picture of cats. It will then generate this output while using the black and white image as a base. It's fairly simple to do but it can take a while to get a quality result such as this one.
I'm able to read it easily if I squint really hard, to the point that my eyes are nearly closed. Alternatively, it's clear as day when I just look at the thumbnail.
I had a hard time with this too, I didn't even know there were supposed to be words. What finally did it was actually squinting, like so much that my eyes are nearly closed and I can just see a little between my eyelashes. Then it stands out clear as day.