I've been heads-down on the GeminiOmni video tools for the last two weeks, and I want to write up what surprised me. Going in, I assumed Sora 2 would still be the model to beat. By the time I shipped /tools/text-to-video, I'd quietly swapped the default to Google's Veo 3.1 Fast — and not because of quality. Because of accounting.
This is a build note, not a benchmark suite. I'll explain how I made the call, what the math looked like, and what the trade-offs are if you want to follow the same path.
The 90-second version
If you're scanning:
- Veo 3.1 Fast costs $0.15 per second of generated video, billed by the second, published openly on Google's pricing page.
- Sora 2 is only available through ChatGPT Pro ($200/month) or as a $20/month ChatGPT Plus add-on with monthly generation caps. There is no honest per-second number you can put in a financial model.
- For an indie business, the second thing matters more than the first. I can't underwrite a per-video price for my users when my own input cost is "however much $200/month works out to divided by however many clips I generated this month."
- Quality-wise, Sora 2 is still ahead on photorealistic faces and complex camera moves. Veo 3.1 Fast wins on synchronized native audio and is good enough on everything else for indie video work — product demos, social hooks, lifestyle B-roll.
How I actually ran the comparison
I had a list of 30 prompts that mirror the kind of thing GeminiOmni users will type. About a third were product-demo style ("a hand pours espresso into a clear glass mug, slow motion, kitchen lighting"). A third were social-media-hook style ("a 3-second zoom into a phone screen as someone double-taps an Instagram post"). The last third were lifestyle B-roll ("morning light through a Berlin balcony, ivy moving in the breeze").
I generated each prompt on:
- Veo 3.1 Fast via the Gemini Developer API at $0.15/sec
- Veo 3.1 Standard via the same API at $0.40/sec
- Sora 2 via my ChatGPT Pro account, generation time only — no fair per-clip cost available
I scored each on a private 1–5 scale across four dimensions: subject fidelity, motion plausibility, audio quality (where applicable), and prompt adherence.
I'm not publishing the per-prompt scores because they're noisy at n=30 and I don't want anyone to treat them as a benchmark. What I'll share is the pattern:
- Veo 3.1 Fast beat Sora 2 outright on audio, every single time. Synchronized footsteps, water sounds, ambient room tone — Veo's native audio is the killer feature, and Sora doesn't ship it at all.
- Sora 2 beat Veo 3.1 Fast on faces and lip movements, especially in tight close-ups. If your use case involves a person speaking directly to camera, the gap is real.
- Veo 3.1 Standard closed about half the face/lip gap for 2.7× the cost. I'm leaving it as a toggle on the Pro plan, not the default.
Inside the API loop, the cost picture was:
- 8-second clip on Veo 3.1 Fast: $1.20
- 8-second clip on Veo 3.1 Standard: $3.20
- 8-second clip on Sora 2: I genuinely can't tell you. I generated 30 clips that month and paid $200 flat.
Why per-second pricing matters more than you think
Here's the indie-business case for transparent per-call pricing, in two paragraphs.
If I want to offer a "Pro" plan at $19/month with a cap of 100 video clips, I need to know my unit economics on every clip. With Veo 3.1 Fast at $0.15/sec and an average 6-second output, my cost per clip is about $0.90. A heavy Pro user who hits the cap costs me $90, leaves me $71 of margin to cover everything else, which is workable. The numbers are boring and they are right there in the pricing page.
If I tried to run the same business on Sora, the input cost would be a step function that depends on whether I was over or under whatever OpenAI's hidden quota is this month. I would be one quota change away from a unit-economics inversion. I cannot underwrite that. No indie can.
This isn't a knock on Sora quality. It's a knock on the business model. OpenAI is selling Sora to consumers through a flat-rate ChatGPT subscription. Google is selling Veo to developers through a metered API. The first model is great for ChatGPT users and unusable for downstream builders. The second is the opposite.
The audio thing is real
I want to spend one section on Veo 3.1's synchronized audio because I underestimated it.
When you generate "a person walks across a wooden floor, morning light, side view" on Veo 3.1, you get the picture and the footsteps in one model pass. Phase-aligned. Stereo. With the correct room reverb for the visible space. I assumed this would be a gimmick — "AI sound effects" — and it would be obviously wrong half the time.
It's not obviously wrong. It's so close to right that I keep forgetting it's generated. The footsteps land on the same frames as the visible foot impacts. The room reverb scales with the apparent room size. Ambient sound matches the scene (cafe murmur for cafe footage, wind for outdoor footage).
For social-media indie video, this completely removes the post-production audio step. You're not sourcing a stock audio library. You're not paying a sound designer for B-roll. You're not running a separate sync pass. The model did it, and it's good enough to ship.
Every "AI video" tool that doesn't have synchronized audio is now competing with one that does, for the same price. The ones without have a problem.
What I built on top
The default route in /tools/text-to-video is now Veo 3.1 Fast. The Pro plan has a toggle for Veo 3.1 Standard if a user needs the extra fidelity on close-up faces. I added a small server-side cap on Free-tier generations (5 per month, 720p, watermarked) because at $0.15/sec the math only works if Free users convert at a reasonable rate.
A few opinionated implementation choices, in case you're building something similar:
- I keep the user's prompt and the upscaled, model-name-stamped response on the server for 30 days on Pro accounts and discard immediately on Free. Storing prompts is privacy-fraught and not worth the savings.
- I render the model name underneath every generated clip — literally a small "Veo 3.1 Fast" badge. People deserve to know what they paid for.
- I do not auto-renew credits. If you bought a 400-credit pack and didn't use it in 12 months, it expires. I'd rather lose a $29 sale than be the company that quietly drains your wallet.
What I'd watch for at I/O
Two things to keep an eye on at Google I/O 2026 next week, both of which would change my defaults:
Gemini Omni. A new video model has leaked through the Gemini app UI — strings like "Create with Gemini Omni: meet our new video model, remix your videos, edit directly in chat" surfaced on May 5. If Google ships an Omni-class video model with the same API economics as Veo, the indie defaults shift again. I'll write that up the day it lands.
Sora 2 API. OpenAI has hinted at a metered Sora API. If it ships at a transparent per-second price below $0.30, the face-quality gap probably wins. If it ships gated behind enterprise sales or quota mysteries, indie builders should ignore it. I'll believe the metered API when I see the pricing page.
Read next
- Image to video on Veo 3.1 — reference frames for character consistency
- Text to video AI — current default + Pro toggle
- GeminiOmni pricing — how I picked $19 / $79
— Lena
