
An evaluation of the Gemini 2.5 flash and Claude 3.7 Sonnet model for an agentic engine.
I had a simple rule when choosing an LLM for Ozigi: don't pick based on benchmark leaderboards. After my v2 launch, in recieving feedback, a user suggested I use the Claude models as they were better for content generation than Gemini. While the suggestion sounded tempting, I had to pick a model based on the four constraints my production pipeline couldn't negotiate around.
Most "Gemini vs Claude" comparisons evaluate general-purpose capabilities like coding, reasoning, and creative writing. That's useful if you're building a general-purpose product.
I wasn't.
Ozigi is a content engine. You feed it a URL, a PDF, or raw notes. It returns a structured 3-day social media campaign as a JSON payload that the frontend maps directly into UI cards.
That specificity made the evaluation easier than I expected: Two models, Four constraints. One clear winner on three of the constraints.
This is the third post in the Ozigi Changelog Series. If you want the backstory on why Ozigi exists, start with how I vibe-coded the internal tool that became it, and the v2 changelog that introduced the modular architecture this decision was built on.
Here's the full Architecture Decision Record.
The core API route in Ozigi does this:
multipart/form-data payload containing a URL, raw text, and/or a file (PDF or image)The frontend then does this:
const parsed = JSON.parse(responseText);
setCampaign(parsed.campaign);
No middleware. No schema validation. No error recovery in the happy path. Raw parse, straight into React state.
That single line is why model selection mattered.
The requirement: The model must return a valid JSON object — every time, without wrapping it in markdown code fences, without adding a conversational preamble, and without hallucinating a trailing comma that breaks JSON.parse().
The target schema looks like this:
{
"campaign": [
{ "day": 1, "x": "...", "linkedin": "...", "discord": "..." },
{ "day": 2, "x": "...", "linkedin": "...", "discord": "..." },
{ "day": 3, "x": "...", "linkedin": "...", "discord": "..." }
]
}
It renders nine posts across three platforms in a span of three days, with every field required.
The UI renders each field into a separate card with edit, copy, and publish actions. A missing key doesn't throw a visible error — it silently renders an empty card.
This comparison is specifically between Gemini with responseSchema enforcement and Claude with prompted JSON, not between each model's structural output ceiling. Claude's tool use with tool_choice: {type: "tool"} enforces schema at the decoding layer and can reach equivalent reliability. The relevant constraint here was which enforcement mechanism was available and practical within my existing stack. More on that below.
I ran 500 automated test generations against both models targeting this schema, measuring the percentage of responses that JSON.parse() accepted without exceptions.
| Model | Format Adherence Rate |
|---|---|
| Gemini 2.5 Flash | 99.9% |
| Claude 3.7 Sonnet (prompted) | ~88.5% |
The 11.5% gap maps directly to broken UI states for real users. That was not acceptable to me for a core feature.
Using Gemini's responseSchema closes this entirely. According to Google's controlled generation documentation, the feature physically prevents the model from returning output that doesn't conform to your schema. It's not prompt-level guidance, it's enforced at the decoding layer. Here's what the production implementation looks like for Ozigi: the schema is defined once at the top of the route and attached directly to the model config:
const distributionSchema = {
type: "OBJECT" as const,
properties: {
campaign: {
type: "ARRAY" as const,
description: "A list of 3 daily social media posts.",
items: {
type: "OBJECT" as const,
properties: {
day: { type: "INTEGER" as const, description: "Day number (1, 2, or 3)" },
x: { type: "STRING" as const, description: "Content for X/Twitter." },
linkedin: { type: "STRING" as const, description: "Content for LinkedIn." },
discord: { type: "STRING" as const, description: "Content for Discord." },
},
required: ["day", "x", "linkedin", "discord"],
},
},
},
required: ["campaign"],
};
const model = vertex_ai.getGenerativeModel({
model: "gemini-2.5-flash",
generationConfig: {
responseMimeType: "application/json",
responseSchema: distributionSchema,
},
});
response.text() is now structurally guaranteed to be valid JSON. JSON.parse() cannot fail on a missing field, trailing comma, or conversational preamble — the model is physically prevented from producing them.
Claude's tool use and function calling can achieve similar guarantees, but it requires a meaningfully different integration architecture. With the Vertex SDK, this is one config block.
Winner: Gemini.
The requirement: Ozigi has a free, unauthenticated sandbox. Anyone can generate a full 3-day campaign without signing up.
That changes the economics of model selection completely. A paying user on a premium plan will tolerate a 20-second wait if the output quality justifies it. An anonymous user who found the product via my whacky marketing efforts will not. They'll close the tab at 10 seconds and probably not come back, sadly.
I benchmarked both models against a standard 10,000-token input payload via Vercel serverless functions (my production environment):
| Model | Avg Response Time |
|---|---|
| Gemini 2.5 Flash | ~6.2s |
| Claude 3.7 Sonnet | ~21.5s |
Methodology: N=100 requests per model, measured end-to-end from Vercel function invocation to full response. Results are environment-dependent and intended for directional comparison, not as absolute benchmarks.
The gap holds across payload sizes. Gemini Flash consistently comes in under 10-15 seconds. Claude 3.7 Sonnet consistently exceeds 20 seconds on the same inputs, in the same environment.
This gap would narrow significantly with streaming: getting first tokens in front of the user within 2-3 seconds. Streaming changes the perceived wait time for a user entirely. This is, however, a v4 architecture item that is being worked on. For a non-streaming pipeline with a public sandbox, the 3.5x latency difference is a product decision, not just an engineering one.
Winner: Gemini Flash — and it's not close for non-streaming public sandboxes.
The requirement: Users can upload PDFs and images directly as context. The pipeline needs to process them without an external preprocessing step.
With Gemini via the Vertex AI Node.js SDK, the entire PDF pipeline is:
// /app/api/generate/route.ts
if (file && file.size > 0) {
const arrayBuffer = await file.arrayBuffer();
const base64Data = Buffer.from(arrayBuffer).toString("base64");
parts.push({
inlineData: {
data: base64Data,
mimeType: file.type, // "application/pdf", "image/jpeg", etc.
},
});
}
const result = await model.generateContent({
contents: [{ role: "user", parts: parts }],
});
You can see that the SDK handles the buffer natively. Gemini reads the PDF directly as part of the multipart request alongside the text prompt — no OCR step, no preprocessing, no separate service call. Google's multimodal documentation confirms that Gemini was designed from the ground up to handle PDF and image buffers natively via inlineData.
An earlier version of this article claimed that Claude required an external OCR step for PDF ingestion. That was wrong. Claude's Messages API does support native base64 PDF ingestion directly via a document content block — no OCR preprocessing, no external service. The pattern is structurally similar to Vertex AI's inlineData, just different field names.
The real constraint here was ecosystem, not capability. I evaluated Claude 3.7 Sonnet as available in the Google Model Garden within my existing Vertex AI setup. Switching to Claude's native PDF ingestion would have meant moving to the Anthropic Messages API entirely — a different provider, different SDK, different billing. The Vertex AI path was simpler for the stack I was already running.
Winner: Gemini — for this stack. Both models support native multimodal ingestion without external OCR. The advantage here was ecosystem fit, not a fundamental capability difference.
The requirement: Generated social media posts must sound like a human wrote them. Specifically, they must pass AI content detection and avoid the predictable cadence patterns that make AI-generated copy immediately identifiable.
This is the constraint where Claude wins cleanly on base performance.
Our internal blind A/B evaluations of 50 technical posts (scored on pragmatic sentence structure and absence of AI terminology) gave Claude 3.7 Sonnet a "human cadence quality score" of 9.5/10. Gemini Flash's base score was 5.5/10.
That's a significant gap. And it's for the feature that is Ozigi's core value proposition.
Because the gap is engineerable.
We built the Banned Lexicon — a programmatic constraint injected at the system prompt level that explicitly penalizes the vocabulary patterns that make AI copy detectable. You can read the full implementation in the Ozigi documentation:
THE BANNED LEXICON: You are strictly forbidden from using the
following words or their variations: delve, testament, tapestry,
crucial, vital, landscape, realm, unlock, supercharge, revolutionize,
paradigm, seamlessly, navigate, robust, cutting-edge, game-changer.
Combined with explicit cadence engineering:
BURSTINESS (CADENCE): Write with high burstiness. Do not use
perfectly balanced, medium-length sentences. Mix extremely short,
punchy sentences (2-4 words) with longer, detailed explanations.
PERPLEXITY: Avoid predictable adjectives. Use strong, active verbs
and concrete nouns. Talk like a pragmatic subject matter expert
explaining a concept to people, not a marketer selling a product.
FORMATTING RESTRAINT: You are limited to a MAXIMUM of 1 emoji per
post. Use a maximum of 2 highly relevant hashtags per post.
With these constraints active, Gemini's human cadence score jumps from 5.5 to 9.2 — within acceptable range of Claude's base 9.5.
The key insight: Claude's tone advantage is a default advantage, not an absolute one. Gemini's outputs are more malleable under prompt constraints. For a use case where tone control is the entire product, that malleability is worth more than a higher baseline.
Winner: Gemini + engineering constraints. The tone gap is closeable. The latency and JSON stability gaps on the other constraints are not.
At this stage where Ozigi is a public sandbox, every anonymous page load that can trigger a generation is a billable API call absorbed by the product. Ozigi is at its pre-revenue stage, so this matters a lot.
| Model | Input Cost (per 1M tokens) | Output Cost (per 1M tokens) |
|---|---|---|
| Gemini 2.5 Flash | ~$0.075 | ~$0.30 |
| Claude 3.7 Sonnet | ~$3.00 | ~$15.00 |
Pricing sourced from Google Cloud Vertex AI pricing and Anthropic API pricing.
Pro tip:Verify current rates before production decisions — both have changed multiple times in the past year.
The input cost difference is 40x. The output cost difference is 50x. For a free-tier product with no revenue, the ability to run a public sandbox sustainably is the difference between having a conversion funnel and not having one.
This is an honest ADR. Here's what would change my answer.
When Ozigi finally moves behind a paywall, latency and cost become secondary concerns. A signed-in user on a paid plan is more likely waiting 20 seconds for premium output is a different UX calculation than an anonymous user on a free demo. In that context, Claude's base tone quality becomes much more compelling. I'd be trading economics for output baseline, and the trade might be worth it.
When streaming gets implemented, the latency argument against Claude weakens significantly. Claude 3.7 Sonnet's time-to-first-token via streaming is competitive. A user seeing the first post appear in 2-3 seconds experiences the product very differently than a user staring at a progress bar for 21 seconds. Streaming is on the roadmap.
For an in-depth look at how we tested the pipeline that informs these decisions, see how we E2E test AI agents with Playwright in Next.js.
| Constraint | Gemini 2.5 Flash | Claude 3.7 Sonnet | Winner |
|---|---|---|---|
| JSON Stability (responseSchema) | 99.9% → guaranteed | ~88.5% (prompted) | Gemini |
| Latency (non-streaming) | ~6.2s | ~21.5s | Gemini |
| Native PDF/Image ingestion | Native via Vertex SDK | Native via Messages API | Gemini (Eco-system fit) |
| Base tone quality | 5.5/10 | 9.5/10 | Claude |
| Tone quality (+ constraints) | 9.2/10 | 9.5/10 | Near tie |
| Cost per 1M input tokens | $0.075 | $3.00 | Gemini |
Gemini won on five of six dimensions. Claude won on one — base tone — and that gap was closeable through prompt engineering.
If you're building something similar to Ozigi, these are the constraints worth looking through before you pick an API and start building:
**1. Does your UI depend on structured output? If your frontend calls JSON.parse() on a raw model response, you need API-level schema enforcement — not prompt instructions asking nicely. responseSchema via Vertex AI, Claude's tool use with forced tool_choice, or structured outputs via OpenAI all enforce at the decoding layer. The question isn't which model supports it — most do — it's which enforcement path fits your existing stack.
2. Do you have a free tier or public sandbox? If yes, latency and cost are product decisions that affect conversion, not just infrastructure decisions that affect margins.
**3. Does your use case require multimodal inputs? Most major models now support native PDF and image ingestion without external preprocessing. Map out what the integration looks like within your existing API provider before assuming you need to switch or add infrastructure
4. Where is the base model weakest, and is that gap engineerable? Claude's tone advantage is real. It's also not the only path to human-sounding copy. Engineering constraints at the prompt level can close gaps that feel insurmountable when you're just looking at base benchmarks.
The best model for your product is rarely the one with the highest aggregate score. It's the one that fails least on the constraints you actually can't work around.