A toolkit of multimodal AI experiences powered by Google's Gemini API — image, video, audio, and text generation in one place.

1M token context isn't free — the real per-page cost of Gemini PDF chat

When I started building /tools/pdf-chat, the pitch wrote itself: Gemini 2.5 Flash has a one-million-token context window. You can drop in a 1,500-page PDF and chat with the whole thing in one prompt. No chunking, no RAG, no embedding pipeline, no retrieval errors. Magic.

Magic, when you do the math, is somewhat expensive.

This post is the inside-baseball look at what a million-token context actually costs to run, how that shaped GeminiOmni's pricing, and when traditional chunked RAG is still the right answer. If you're building anything that touches long documents, this is the trade-off you need to internalize before you commit to a tier structure.

The headline number

Gemini 2.5 Flash, currently the cheapest mainstream 1M-context model, is priced at $0.30 per million input tokens and $2.50 per million output tokens. That's the lowest published 1M-context pricing I'm aware of, and it's still 8× cheaper than Gemini 2.5 Pro at the same context size.

A million tokens is a lot of tokens. Specifically:

A million tokens ≈ 750,000 English words ≈ 1,500 prose pages ≈ 3,000 dense academic-paper pages
For a typical scanned PDF (where the model has to OCR images), expect closer to 800–1,000 pages per million tokens because the images consume token budget themselves

Now the multiplication that nobody wants to do in public:

Sending a 200-page PDF (≈130k tokens) once costs $0.039. That's about four cents. Cheap.
Sending a 1,500-page PDF (≈980k tokens) once costs $0.294. Thirty cents.
Asking ten follow-up questions, where each one re-sends the full PDF for context, costs $2.94 for the 1,500-page document. Three dollars.

That last number is the one that broke my pricing model.

Why follow-up questions re-send the whole PDF

This is the part nobody writing breathless 1M-context blog posts mentions. Gemini's context window is per call, not per session. There is no persistent server-side state holding your document for the next message.

When you ask the second question in a chat session, the client has to send the full PDF again, plus the new question, plus the model's previous answer (so it has continuity). Every turn is roughly the same cost as the first turn. There is no caching discount.

Google has shipped a "context caching" feature that helps here — you can cache a long input for up to an hour and pay a lower per-call rate for re-using it — but the cache itself isn't free either. The cache storage cost runs around $1 per million tokens per hour for Flash. For a 1,500-page document, that's about $1/hour just to keep it warm.

Three real-world scenarios fall out of this:

One-shot questions on a fresh document. $0.04–$0.30 per question. Genuinely cheap. Worth it.
Ten-question chat session on a fresh document, no caching. $0.40–$3.00. Still acceptable for a paid feature, brutal for a free one.
One-hour deep-dive with 30+ questions on a 1,500-page document. With caching enabled: ~$1 to keep it warm + $0.05/turn = roughly $2.50 total. Without caching: $9. The caching matters once you cross a session-length threshold.

When chunked RAG is actually cheaper

This is the heretical part: for some workloads, the old-school chunk + embed + retrieve pipeline is still cheaper than 1M context, by a lot.

Specifically, if you have a stable document corpus (not a fresh upload each time) and many users asking many questions, RAG wins:

Embedding the corpus: Gemini Embedding 2 at $0.20/1M tokens — embed a 1,500-page document once for $0.20, store the vectors forever.
Per question: retrieve the top 5 chunks (~5k tokens), send them to Gemini 2.5 Flash. Cost per question: about $0.0015. That's a tenth of a cent.
Versus 1M context: $0.30 per question if you re-send the whole doc.

For a customer-support bot on a fixed knowledge base, RAG is 200× cheaper. For a research assistant reading a fresh paper the user just uploaded, RAG's embedding overhead doesn't amortize and 1M context wins.

The decision rule:

Workload	Use
User uploads a one-off document, asks 1–10 questions	1M context (no embedding setup, all retrieval handled by the model)
Static knowledge base, many users, many questions	RAG (amortize embedding cost across all queries)
User uploads multiple documents and wants cross-document reasoning	1M context (RAG retrieval misses cross-document references)
Compliance / legal review on a single dense doc	1M context (RAG can miss critical clauses through chunk boundaries)

GeminiOmni's PDF chat is firmly in the "one-off upload, 1–10 questions" lane — that's why we use 1M context unconditionally.

How this shaped the pricing tier

Once I'd internalized the per-page math, the tier structure for the PDF tool fell out almost mechanically:

Free tier: under 200 pages, unlimited questions. A 200-page PDF costs ~$0.04 per question. Even a heavy free user asking 100 questions in a month costs me $4. That's a sustainable acquisition cost — I can underwrite it against the conversion rate from free to Pro.

Pro tier: $9/month, unlimited document length, unlimited questions. Anchor case: a user uploads three 1,500-page documents per month, asks 30 questions on each = 90 questions × ~$0.30 = $27. That would invert the unit economics. Two things save it:

Most Pro users don't actually push to the limit. The 90th-percentile heavy user costs me about $4/month. The mean is closer to $0.80.
Context caching cuts the per-question cost on long sessions by roughly 4×.

I'm okay running a thin margin on the heaviest Pro users because they're the loudest about how well the product works, and they convert teammates. Trying to extract maximum margin from the top 10% would mean ruining the product for them.

Team tier: $79/month for 3 seats with API access. This is where the real margin lives. Teams pay 4× as much as a Pro user for the same usage profile, because they're paying for collaboration, API access, and SLA — not raw inference. I'm comfortable with this because it mirrors how Notion, Linear, and every other team-tier SaaS works.

What I'd do differently if I were building today

A few opinions I'd be brave enough to put on the page only after two months of running the math:

Surface the per-question cost to the user. When a user uploads a 1,500-page document, the UI should say "each question on this document costs about 1 credit." Hiding the cost from users creates moral hazard — they ask 50 questions when 5 would do, and I eat the margin. Transparent pricing self-regulates this.

Default to caching on documents over 500 pages. The 4× cost reduction on session-length sessions is large enough that "enable caching" should not be a power-user toggle. It should be the default behavior, with a "disable cache for privacy" opt-out for the few users who care.

Don't promise "unlimited" anything. I made the mistake of writing "PDF chat unlimited pages" on the Pro plan. That's now load-bearing copy, and I have to make the unit economics work even for the heaviest user. Future me would have written "up to 5,000 pages per document, fair-use cap on monthly volume." Lesson logged for the next product.