A toolkit of multimodal AI experiences powered by Google's Gemini API — image, video, audio, and text generation in one place.

What we know about Gemini Omni 48 hours before Google I/O

I'm writing this on May 12, 2026 — exactly seven days before Google I/O. If the predictions in this post age badly by May 20, that's fine, I'd rather be on record. If they age well, even better.

What I want to do here is walk through what's actually known about "Gemini Omni" right now, separate the signal from the speculation, and explain what I'm betting on for /tools/text-to-video and the rest of the GeminiOmni stack when the announcement lands.

The string in the app

The most concrete piece of evidence dropped on May 5, when 9to5Google's APK teardown found a new resource in the Gemini Android app:

Create with Gemini Omni: meet our new video model, remix your videos, edit directly in chat, try templates, and more.

A few things jumped out at me on the first read:

"Our new video model" — singular. Not "video models," not "video and audio." Google's framing positions this as the next-generation video offering, even if (as I'll argue below) the model itself is probably more general than that string implies.
"Remix your videos, edit directly in chat" — this is the chat-native editing motion that Nano Banana 2 already does for images. Google is extending the conversational-editing UX to video.
"Try templates" — Google is shipping a templates surface inside the Gemini app. This is a hint about distribution, not about the model itself. Indie developers should assume their video tools are now competing with one-tap templates inside the consumer Gemini app.

Testing Catalog ran a follow-up with sample outputs in the app's debug builds. The aesthetic is closer to Veo 3.1 than to anything I've seen from Imagen, and the templates appear to include some image-to-video flows.

Why I think "Omni" is bigger than the leak suggests

The name "Omni" is doing a lot of work. Google does not call models "Omni" lightly. The naming choice tells me three things:

One model, not three. The pattern Google has been laying down — Nano Banana for in-context image editing, Veo for video, Gemini Flash for everything else — has been three specialized models behind one chat UI. "Omni" reads to me as the unification of that. Picture and audio in one pass. Video and image in one pass. Possibly text generation co-located with media generation in a single forward call.

The video framing is the consumer pitch. Marketing teams put the most legible capability in the announcement string. "New video model" is easier to explain at a keynote than "unified multimodal generative model with native audio and chat-native editing." But the developer surface is probably broader than the keynote suggests.

Omni Reference is the actual feature. If you've used Veo 3.1 with reference frames, you know how powerful the "lock this subject across the entire clip" UX is. The keyword the leak doesn't mention but that is implied by "remix your videos" is — almost certainly — multi-reference Omni-mode: hand it a subject, a style image, and a script, and let one model produce a coherent piece of footage.

What Veo 3.1 already does

To calibrate expectations: anything Veo 3.1 Fast already does is the floor for what Omni ships with. Today, on Veo 3.1 Fast at $0.15 per second:

8-second 1080p video with native synchronized spatial audio
Reference-frame–locked character consistency (1–4 input frames)
Physics-aware motion for cloth, water, hair, and refraction
Text-to-video, image-to-video, and short-form video-to-video remixing
Chat-style edits ("slow this down at 0:03") on previously generated clips

This is already an absurd amount of capability for the price. Whatever Omni ships will be measured against this baseline. If Omni doesn't include a working version of every one of those features, indie developers will quietly stay on Veo 3.1.

Three predictions I'm willing to put my name to

I've been wrong about Google I/O announcements before, so take these with the appropriate dose of salt.

Prediction 1: Omni's headline feature is multi-modal-in-one-prompt. You'll be able to type a single Gemini prompt and get back a structured object containing video, image stills, narration audio, and maybe a transcript — all generated jointly so they're internally consistent. This is the obvious next step from "video and audio in one pass" to "everything in one pass."

Prediction 2: Pricing comes in below the sum of the parts. Today, generating a 30-second narrated explainer video means at least three API calls: Veo for the video, TTS for the narration, Imagen for thumbnail images. If Omni bundles these, Google will price it at something like 70% of the sum. That's the wedge that pulls indie developers off the multi-call workflow.

Prediction 3: There won't be a usable API on day one. Google has a history of announcing models at I/O and shipping API access weeks later. I expect a controlled-access launch where the consumer Gemini app gets Omni immediately, AI Studio gets it within a week, and the public Developer API gets it 2–4 weeks after I/O. Indie builders should plan for a gap.

What I'm shipping on the assumption Omni lands as expected

Three concrete bets I've made in the GeminiOmni codebase based on this read:

A "Coming May 20" placeholder on the homepage. The hero on geminiomni-ai.com mentions Omni Mode explicitly. I'd rather pre-stake the keyword and look mildly silly if the launch slips than scramble to update copy at 3am Pacific on launch day.

Server-side prompt routing. The four tools — text-to-video, image-to-video, Nano Banana edit, PDF chat — all route through a single internal endpoint with a model parameter. When Omni's developer API opens, I change one config value and the same user-facing tool starts using the new model. No client changes required.

No standalone "Omni" UI yet. I'm not building a fifth tool tile labeled "Omni" until I've seen the actual API and decided what the right UX wrapper is. If Omni's signature feature is "one prompt, four media types," the right UI might be a single chat box rather than four separate tools — and that's a bigger product decision than I want to commit to before launch.

What I'm watching during the keynote

If you're tuning in to I/O live on May 19–20, the three things that will most affect what indie builders should ship next:

Pricing. A per-second number for Omni video output. If it's at or below Veo 3.1 Fast, the math holds. If it's 3–5× higher, Omni is a consumer feature, not a developer surface.
API ETA. Whether Sundar Pichai or Demis Hassabis says "available today in the Gemini API" or "rolling out to developers in the coming weeks." Two very different timelines.
Rate limits on Free tier in AI Studio. Whatever Google sets here becomes the de facto ceiling for what indie tools can offer on their own Free plans.

I'll write this up again on May 21 with what actually happened. Until then, I'd love to be wrong about any of this.