At the time of writing (July 2026) I gave Veo 3.1, Grok, Kling 3.0, Seedance 2.0 and a few others the same simple request: keep the camera still and only move the people. Nearly all of them acted like they hadn't heard me.
My original idea was to render two versions of every scene: a clean backplate and one with 3D people in it. Animate the people with video AI, then comp them back over the clean plate, keeping the architecture untouched. Spoiler: this is roughly where I ended up, but the road there taught me three things.
1. Video AI changes your original input far less than I assumed.
2. Video AI is expensive!
3. Video AI does not, I repeat, does not want to keep the camera absolutely fixed without some serious prompting.
That third issue is why I landed on Seedance 2.0. It wasn't the cheapest model I tested, but it was the only one that kept the camera locked with any degree of reliability.
A quick note on the different flavours you'll encounter: text to video, video to video, video upscaling, and the one we're using here: image to video. Whatever platform you choose, the workflow is similar. Upload an input image, write a prompt, and set your duration, resolution and whether you want generated sound.
Make sure your input image (your render) matches an aspect ratio the video AI supports. I went with 16:9. My input was 4K because I need that resolution later, but keep in mind that most video AIs max out at 1080p. If a platform advertises 4K output, that's usually a built-in upscale feature, not native 4K generation.
For the record: I ended up using atlascloud.ai for video generation and gigapixelai.com for upscaling. Atlascloud can go straight to 4K, but doing the upscale at Gigapixel turned out to be considerably cheaper.