Back

LLM Structured Output in 2026: Stop Parsing JSON with Regex and Do It Right

You've been there. You ask GPT to "return a JSON object with the user's name, email, and sentiment score." It returns a perfectly formatted JSON... wrapped in a markdown code block. With a helpful explanation. And a disclaimer about how it's an AI.

So you write a regex to strip the code fences. Then another regex for the trailing commentary. Then it randomly returns JSONL instead of JSON. Then it wraps everything in {"result": ...} when you didn't ask for that. Then it works perfectly for 10,000 requests and fails catastrophically on request 10,001 because the user's name contained a quote character.

This is the structured output problem, and in 2026, you should not be solving it by hand anymore.

Every major LLM provider now offers native structured output. The tooling (Pydantic for Python, Zod for TypeScript) has matured enormously. And yet, most developers are still either parsing raw strings or using function calling as a hacky workaround.

This guide covers everything: how structured output actually works under the hood, how to implement it across OpenAI, Anthropic, and Gemini, the Python and TypeScript ecosystems, and โ€” most importantly โ€” the production pitfalls that will bite you if you don't know about them.


Why Structured Output Matters (More Than You Think)

Here's the fundamental problem with LLMs in production:

LLMs are text generators.
Your application needs data structures.
The gap between these two things is where bugs live.

When you JSON.parse() a raw LLM response, you're making several dangerous assumptions:

  1. The output is valid JSON (it might not be)
  2. The JSON has the fields you expect (it might not)
  3. The field types are correct (strings vs numbers vs booleans)
  4. The values are within expected ranges (sentiment: -1 to 1, not "positive")
  5. The response doesn't contain extra fields you didn't ask for
  6. The response format is consistent across different inputs

Structured output eliminates all six of these problems by constraining the model's output at the token generation level โ€” not after the fact.

The Three Levels of Output Control

Level 1: Prompt Engineering (Unreliable)
  "Return JSON with fields: name, email, score"
  โ†’ Works 80-95% of the time
  โ†’ Fails silently on edge cases
  โ†’ No type guarantees

Level 2: Function Calling / Tool Use (Better)
  Define a function schema, model "calls" it
  โ†’ Works 95-99% of the time
  โ†’ Schema is a hint, not a constraint
  โ†’ Can still produce invalid values within valid types

Level 3: Native Structured Output (Best)
  Constrained decoding with JSON Schema
  โ†’ Works 100% of the time (schema-valid guaranteed)
  โ†’ Uses finite state machines to mask invalid tokens
  โ†’ Types AND values are enforced at generation time

In 2026, you should be at Level 3 for anything going to production.


How Structured Output Actually Works

Most developers treat structured output as a black box: "I give it a schema, it returns valid JSON." But understanding the mechanism matters for debugging and optimization.

Constrained Decoding (The Magic Behind the Curtain)

When an LLM generates text, it predicts the next token from a vocabulary of ~100,000+ tokens. Normally, any token can follow any other token. Structured output adds a constraint layer:

Normal generation:
  Token probabilities: {"hello": 0.3, "{": 0.1, "The": 0.2, ...}
  โ†’ Any token can be selected

Constrained generation (expecting JSON object start):
  Token probabilities: {"hello": 0.3, "{": 0.1, "The": 0.2, ...}
  Mask: {"hello": 0, "{": 1, "The": 0, ...}
  โ†’ Only "{" and whitespace tokens remain valid
  โ†’ Model MUST output "{"

This is implemented using a Finite State Machine (FSM) that tracks where you are in the JSON schema:

State Machine for {"name": string, "age": integer}:

START โ†’ expect "{"
  โ†’ expect "\"name\""
    โ†’ expect ":"
      โ†’ expect string value
        โ†’ expect "," or "}"
          โ†’ if ",": expect "\"age\""
            โ†’ expect ":"
              โ†’ expect integer value
                โ†’ expect "}"
                  โ†’ DONE

At each state, the FSM masks out all tokens that would violate the schema. The model can still choose the most likely valid token, preserving quality while guaranteeing structure.

Why This Is Better Than Prompt Engineering

Prompt: "Return a JSON object with 'score' as a number between 0 and 1"

Without constrained decoding:
  Model might output: {"score": "0.85"}     โ† string, not number
  Model might output: {"score": 0.85, "confidence": "high"}  โ† extra field
  Model might output: {"score": 85}         โ† wrong range
  Model might output: Sure! Here's the JSON: {"score": 0.85}  โ† preamble

With constrained decoding:
  Model MUST output: {"score": 0.85}        โ† always valid

Implementation: OpenAI

OpenAI's structured output is the most mature. It's available in the Chat Completions API with response_format.

Basic Usage

from openai import OpenAI from pydantic import BaseModel client = OpenAI() class SentimentAnalysis(BaseModel): sentiment: str # "positive", "negative", "neutral" confidence: float key_phrases: list[str] reasoning: str response = client.beta.chat.completions.parse( model="gpt-5-mini", messages=[ {"role": "system", "content": "Analyze the sentiment of the given text."}, {"role": "user", "content": "This product is absolutely terrible. Worst purchase ever."} ], response_format=SentimentAnalysis, ) result = response.choices[0].message.parsed print(result.sentiment) # "negative" print(result.confidence) # 0.95 print(result.key_phrases) # ["absolutely terrible", "worst purchase ever"]

With Enums and Nested Objects

from enum import Enum from pydantic import BaseModel, Field class Sentiment(str, Enum): positive = "positive" negative = "negative" neutral = "neutral" mixed = "mixed" class Entity(BaseModel): name: str type: str = Field(description="person, organization, product, or location") sentiment: Sentiment class FullAnalysis(BaseModel): overall_sentiment: Sentiment confidence: float = Field(ge=0.0, le=1.0) entities: list[Entity] summary: str = Field(max_length=200) topics: list[str] = Field(min_length=1, max_length=5) response = client.beta.chat.completions.parse( model="gpt-5-mini", messages=[ {"role": "system", "content": "Extract structured analysis from the text."}, {"role": "user", "content": "Apple's new MacBook Pro is incredible, but Tim Cook's keynote was boring."} ], response_format=FullAnalysis, ) result = response.choices[0].message.parsed # result.entities = [ # Entity(name="Apple", type="organization", sentiment="positive"), # Entity(name="MacBook Pro", type="product", sentiment="positive"), # Entity(name="Tim Cook", type="person", sentiment="negative"), # ]

TypeScript with Zod

import OpenAI from 'openai'; import { z } from 'zod'; import { zodResponseFormat } from 'openai/helpers/zod'; const client = new OpenAI(); const SentimentSchema = z.object({ sentiment: z.enum(['positive', 'negative', 'neutral', 'mixed']), confidence: z.number().min(0).max(1), entities: z.array(z.object({ name: z.string(), type: z.enum(['person', 'organization', 'product', 'location']), sentiment: z.enum(['positive', 'negative', 'neutral']), })), summary: z.string(), topics: z.array(z.string()).min(1).max(5), }); type Sentiment = z.infer<typeof SentimentSchema>; const response = await client.beta.chat.completions.parse({ model: 'gpt-5-mini', messages: [ { role: 'system', content: 'Extract structured analysis from the text.' }, { role: 'user', content: 'The new React compiler is amazing but the migration docs are lacking.' }, ], response_format: zodResponseFormat(SentimentSchema, 'sentiment_analysis'), }); const result: Sentiment = response.choices[0].message.parsed!; console.log(result.sentiment); // "mixed"

Implementation: Anthropic (Claude)

Anthropic's approach to structured output uses tool use (function calling) as the mechanism. You define a tool with a JSON schema, and Claude returns structured output as if calling that tool.

Basic Usage

import anthropic from pydantic import BaseModel client = anthropic.Anthropic() class ExtractedData(BaseModel): name: str email: str company: str role: str urgency: str # "low", "medium", "high", "critical" response = client.messages.create( model="claude-sonnet-4-20260514", max_tokens=1024, tools=[{ "name": "extract_contact", "description": "Extract contact information from the email.", "input_schema": ExtractedData.model_json_schema(), }], tool_choice={"type": "tool", "name": "extract_contact"}, messages=[{ "role": "user", "content": """Extract the contact info from this email: Hi, I'm Sarah Chen from DataFlow Inc. Our production pipeline is down and we need immediate help. Please reach me at [email protected] โ€” I'm the VP of Engineering.""", }], ) # Extract the tool use result tool_result = next( block for block in response.content if block.type == "tool_use" ) data = ExtractedData(**tool_result.input) print(data.name) # "Sarah Chen" print(data.urgency) # "critical"

TypeScript with Zod + Anthropic

import Anthropic from '@anthropic-ai/sdk'; import { z } from 'zod'; import { zodToJsonSchema } from 'zod-to-json-schema'; const client = new Anthropic(); const ContactSchema = z.object({ name: z.string(), email: z.string().email(), company: z.string(), role: z.string(), urgency: z.enum(['low', 'medium', 'high', 'critical']), }); const response = await client.messages.create({ model: 'claude-sonnet-4-20260514', max_tokens: 1024, tools: [{ name: 'extract_contact', description: 'Extract contact information from the email.', input_schema: zodToJsonSchema(ContactSchema) as Anthropic.Tool.InputSchema, }], tool_choice: { type: 'tool' as const, name: 'extract_contact' }, messages: [{ role: 'user', content: 'Extract info: Hi, I am John Park, CTO at Acme Corp ([email protected]). Not urgent.', }], }); const toolBlock = response.content.find( (block): block is Anthropic.ToolUseBlock => block.type === 'tool_use' ); const data = ContactSchema.parse(toolBlock!.input); console.log(data.urgency); // "low"

Implementation: Google Gemini

Gemini supports structured output natively through its response_schema parameter. It uses constrained decoding similar to OpenAI.

Basic Usage (Python)

import google.generativeai as genai from pydantic import BaseModel from enum import Enum class Priority(str, Enum): low = "low" medium = "medium" high = "high" critical = "critical" class TaskExtraction(BaseModel): title: str assignee: str priority: Priority deadline: str | None tags: list[str] model = genai.GenerativeModel( "gemini-2.5-flash", generation_config=genai.GenerationConfig( response_mime_type="application/json", response_schema=TaskExtraction, ), ) response = model.generate_content( "Extract the task: 'John needs to fix the login bug by Friday. It's blocking prod. Tag it as backend and auth.'" ) import json result = TaskExtraction(**json.loads(response.text)) print(result.priority) # "critical" print(result.tags) # ["backend", "auth"]

The Provider Comparison Table

Before choosing a provider for structured output, here's how they compare:

Feature              OpenAI           Anthropic         Gemini
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€    โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€   โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€    โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
Method               Native SO        Tool Use          Native SO
Constrained decode   Yes              Partial           Yes
100% schema valid    Yes              99%+              Yes
Streaming support    Yes              Yes               Yes
Pydantic native      Yes (.parse)     Manual schema     Manual schema
Zod native           Yes (helper)     Manual convert    Manual convert
Nested objects       Yes              Yes               Yes
Enums                Yes              Yes               Yes
Optional fields      Yes              Yes              Yes
Recursive schemas    Limited          Yes               Limited
Max schema depth     5 levels         No limit          No limit
Refusal handling     Yes              N/A               N/A

Recommendation: If you need guaranteed schema compliance, use OpenAI or Gemini's native structured output. If you're already on Claude and need structured data, the tool use pattern works well but add Pydantic/Zod validation as a safety net.


Production Patterns That Actually Work

Pattern 1: The Validation Sandwich

Never trust the LLM output directly, even with structured output. Always validate.

from pydantic import BaseModel, Field, field_validator from openai import OpenAI client = OpenAI() class ProductReview(BaseModel): rating: int = Field(ge=1, le=5) title: str = Field(min_length=5, max_length=100) pros: list[str] = Field(min_length=1, max_length=5) cons: list[str] = Field(max_length=5) would_recommend: bool @field_validator('title') @classmethod def title_not_generic(cls, v: str) -> str: generic_titles = ['good', 'bad', 'ok', 'fine', 'great'] if v.lower().strip() in generic_titles: raise ValueError(f'Title too generic: {v}') return v def extract_review(text: str) -> ProductReview: response = client.beta.chat.completions.parse( model="gpt-5-mini", messages=[ {"role": "system", "content": "Extract a structured product review."}, {"role": "user", "content": text}, ], response_format=ProductReview, ) result = response.choices[0].message.parsed if response.choices[0].message.refusal: raise ValueError(f"Model refused: {response.choices[0].message.refusal}") # Re-validate even though OpenAI guarantees schema compliance # (catches business logic violations that JSON Schema can't express) return ProductReview.model_validate(result.model_dump())

Pattern 2: Retry with Escalation

When structured output fails (rare but it happens), escalate gracefully:

import time from tenacity import retry, stop_after_attempt, wait_exponential @retry( stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=1, max=10), ) def extract_with_retry(text: str, schema: type[BaseModel]) -> BaseModel: try: response = client.beta.chat.completions.parse( model="gpt-5-mini", messages=[ {"role": "system", "content": "Extract structured data precisely."}, {"role": "user", "content": text}, ], response_format=schema, ) result = response.choices[0].message.parsed return schema.model_validate(result.model_dump()) except Exception as e: print(f"Attempt failed: {e}") raise # Usage try: review = extract_with_retry(user_text, ProductReview) except Exception: # Fallback: simpler schema or manual processing review = extract_with_retry(user_text, SimpleReview)

Pattern 3: Multi-Provider Fallback

Don't lock yourself into one provider. Build a fallback chain:

import OpenAI from 'openai'; import Anthropic from '@anthropic-ai/sdk'; import { z } from 'zod'; import { zodResponseFormat } from 'openai/helpers/zod'; import { zodToJsonSchema } from 'zod-to-json-schema'; const schema = z.object({ intent: z.enum(['question', 'complaint', 'feedback', 'request']), urgency: z.enum(['low', 'medium', 'high']), summary: z.string().max(200), action_required: z.boolean(), }); type TicketClassification = z.infer<typeof schema>; async function classifyTicket(text: string): Promise<TicketClassification> { // Try OpenAI first (fastest structured output) try { const openai = new OpenAI(); const response = await openai.beta.chat.completions.parse({ model: 'gpt-5-mini', messages: [ { role: 'system', content: 'Classify the support ticket.' }, { role: 'user', content: text }, ], response_format: zodResponseFormat(schema, 'ticket'), }); return schema.parse(response.choices[0].message.parsed); } catch (openaiError) { console.warn('OpenAI failed, falling back to Claude:', openaiError); } // Fallback to Anthropic try { const anthropic = new Anthropic(); const response = await anthropic.messages.create({ model: 'claude-sonnet-4-20260514', max_tokens: 512, tools: [{ name: 'classify', description: 'Classify the ticket.', input_schema: zodToJsonSchema(schema) as Anthropic.Tool.InputSchema, }], tool_choice: { type: 'tool' as const, name: 'classify' }, messages: [{ role: 'user', content: `Classify: ${text}` }], }); const toolBlock = response.content.find( (b): b is Anthropic.ToolUseBlock => b.type === 'tool_use' ); return schema.parse(toolBlock!.input); } catch (anthropicError) { console.error('Both providers failed:', anthropicError); throw new Error('All LLM providers failed for structured output'); } }

Pattern 4: Streaming Structured Output

For long-form structured responses, stream partial results:

from openai import OpenAI from pydantic import BaseModel client = OpenAI() class Article(BaseModel): title: str sections: list[dict] # {"heading": str, "content": str} tags: list[str] word_count: int # Streaming with structured output with client.beta.chat.completions.stream( model="gpt-5", messages=[ {"role": "system", "content": "Generate an article outline with detailed sections."}, {"role": "user", "content": "Write about WebAssembly in 2026."} ], response_format=Article, ) as stream: for event in stream: # Get partial JSON as it's generated snapshot = event.snapshot if snapshot and snapshot.choices[0].message.content: partial = snapshot.choices[0].message.content print(f"Receiving: {len(partial)} chars...") # Final parsed result final = stream.get_final_completion() article = final.choices[0].message.parsed print(f"Article: {article.title} ({article.word_count} words)")

The Pitfalls Nobody Talks About

Pitfall 1: The Schema Complexity Tax

Every constraint you add to your schema increases latency. Complex schemas with deeply nested objects, many enums, and strict validation can double or triple your response time.

Schema complexity vs. latency (gpt-5-mini, average):

Schema                        Tokens/s    First Token    Total Time
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€  โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€  โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€  โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
No schema (free text)         85 tok/s    ~200ms         ~500ms
Simple (3 fields)             78 tok/s    ~250ms         ~550ms
Medium (10 fields, 1 enum)    65 tok/s    ~350ms         ~800ms
Complex (20+ fields, nested)  45 tok/s    ~500ms         ~1.5s
Very complex (recursive)      30 tok/s    ~800ms         ~3s

Solution: Break complex schemas into multiple smaller calls. Instead of one mega-schema, use a pipeline:

# โŒ One giant schema class FullDocumentAnalysis(BaseModel): entities: list[Entity] # 20+ fields each sentiment: SentimentDetail # 10+ fields summary: Summary # 5 fields classification: Classification # 8 fields # ... 50+ total fields # โœ… Pipeline of smaller schemas class Step1_Entities(BaseModel): entities: list[SimpleEntity] # 5 fields each class Step2_Sentiment(BaseModel): overall: str confidence: float aspects: list[str] class Step3_Classification(BaseModel): category: str subcategory: str priority: str # Run in parallel import asyncio entities, sentiment, classification = await asyncio.gather( extract(text, Step1_Entities), extract(text, Step2_Sentiment), extract(text, Step3_Classification), )

Pitfall 2: Schema Versioning Hell

Your application evolves. Your schema evolves. But the LLM doesn't know that you renamed user_name to name last Tuesday.

# Version 1 (deployed January) class UserProfile_v1(BaseModel): user_name: str email_address: str age: int # Version 2 (deployed February) class UserProfile_v2(BaseModel): name: str # renamed! email: str # renamed! age: int location: str | None # new field # Problem: Old cached prompts still reference v1 field names. # Problem: Downstream consumers expect v1 format. # Problem: A/B tests run both versions simultaneously.

Solution: Use explicit schema versioning and migration:

from pydantic import BaseModel, Field from typing import Literal class UserProfile(BaseModel): schema_version: Literal["2.0"] = "2.0" name: str = Field(alias="user_name") # Accept old field name email: str = Field(alias="email_address") age: int location: str | None = None class Config: populate_by_name = True # Accept both alias and field name

Pitfall 3: The Empty Array Trap

LLMs struggle with returning empty arrays when there's genuinely nothing to extract. They'll often hallucinate entries to "fill" the array.

# Input: "The weather is nice today." # Expected: {"entities": [], "topics": ["weather"]} # Actual: {"entities": [{"name": "weather", "type": "concept"}], "topics": ["weather"]} # The model HATES returning empty arrays. # Solution: Make empty arrays explicitly valid and prompt for them class Extraction(BaseModel): entities: list[Entity] = Field( default_factory=list, description="Named entities found in the text. Return empty list [] if none found." )

Pitfall 4: Enum Confusion with Similar Values

class Priority(str, Enum): low = "low" medium = "medium" high = "high" critical = "critical" urgent = "urgent" # โ† How is this different from "critical"? # The model will inconsistently choose between "critical" and "urgent" # because THEY don't even know the difference. # Solution: Use fewer, clearly distinct enum values with descriptions class Priority(str, Enum): low = "low" # Can wait days/weeks medium = "medium" # Should be handled this sprint high = "high" # Needs attention today critical = "critical" # Production is down, fix NOW

Pitfall 5: Token Limits and Truncation

Structured output doesn't bypass token limits. If your schema requires a summary field with max_length=500 but the model hits max_tokens before completing the JSON, you get:

{"title": "Analysis", "summary": "The product shows excellent performance in

That's invalid JSON. The response is cut off.

Solution: Always set max_tokens significantly higher than your expected output, and handle the finish_reason:

response = client.beta.chat.completions.parse( model="gpt-5-mini", messages=[...], response_format=MySchema, max_tokens=4096, # Be generous ) if response.choices[0].finish_reason == "length": # Response was truncated! Retry with higher max_tokens or simpler schema. raise ValueError("Response truncated โ€” increase max_tokens or simplify schema")

Pitfall 6: The Refusal Trap (OpenAI Specific)

OpenAI's structured output can refuse to generate content if the input triggers safety filters. When this happens, message.parsed is None and message.refusal contains the reason.

response = client.beta.chat.completions.parse( model="gpt-5-mini", messages=[ {"role": "user", "content": "Analyze this customer complaint: [potentially sensitive content]"} ], response_format=Analysis, ) parsed = response.choices[0].message.parsed refusal = response.choices[0].message.refusal if refusal: # Don't crash! Handle gracefully. print(f"Model refused: {refusal}") # Fallback: use a different model, rephrase, or flag for human review elif parsed: process(parsed)

Pydantic vs Zod: The Definitive Comparison

If you're choosing between Python and TypeScript for your LLM pipeline, here's how the validation libraries compare:

Feature                  Pydantic (Python)     Zod (TypeScript)
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€  โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€  โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
Type inference           From annotations      From .infer<>
Runtime validation       Built-in              Built-in
JSON Schema export       .model_json_schema()  zodToJsonSchema()
Default values           Field(default=...)    .default(value)
Custom validators        @field_validator      .refine() / .transform()
Nested objects           Native                Native
Discriminated unions     Supported             .discriminatedUnion()
Recursive schemas        Supported             z.lazy()
Serialization            .model_dump()         N/A (plain objects)
ORM integration          Yes (SQLAlchemy)      Drizzle/Prisma
Community size           Massive               Massive
OpenAI native support    Yes (.parse)          Yes (zodResponseFormat)
Anthropic integration    .model_json_schema()  zodToJsonSchema()

When to Use Pydantic

# Pydantic shines for complex data pipelines: from pydantic import BaseModel, Field, field_validator, model_validator class Invoice(BaseModel): items: list[LineItem] subtotal: float tax_rate: float = Field(ge=0, le=0.5) total: float @model_validator(mode='after') def validate_total(self) -> 'Invoice': expected = self.subtotal * (1 + self.tax_rate) if abs(self.total - expected) > 0.01: raise ValueError( f'Total {self.total} does not match ' f'subtotal {self.subtotal} ร— (1 + {self.tax_rate}) = {expected}' ) return self

When to Use Zod

// Zod shines for API validation and type-safe pipelines: const InvoiceSchema = z.object({ items: z.array(LineItemSchema), subtotal: z.number().positive(), taxRate: z.number().min(0).max(0.5), total: z.number().positive(), }).refine( (data) => Math.abs(data.total - data.subtotal * (1 + data.taxRate)) < 0.01, { message: 'Total does not match subtotal ร— (1 + taxRate)' } ); // Type is automatically inferred โ€” no separate interface needed type Invoice = z.infer<typeof InvoiceSchema>;

Advanced Pattern: Schema Composition for Complex Workflows

Real-world applications rarely need a single schema. Here's how to compose schemas for a multi-step extraction pipeline:

from pydantic import BaseModel, Field from enum import Enum from openai import OpenAI client = OpenAI() # Step 1: Quick classification (fast, cheap model) class TicketType(str, Enum): bug = "bug" feature = "feature" question = "question" billing = "billing" class QuickClassification(BaseModel): type: TicketType language: str = Field(description="Programming language if applicable, else 'N/A'") needs_human: bool # Step 2: Detailed extraction (only for bugs, use smarter model) class BugReport(BaseModel): title: str = Field(max_length=100) steps_to_reproduce: list[str] = Field(min_length=1) expected_behavior: str actual_behavior: str environment: dict[str, str] # {"os": "...", "browser": "...", etc} severity: str = Field(description="minor, major, or critical") # Step 3: Auto-routing class RoutingDecision(BaseModel): team: str = Field(description="backend, frontend, infra, or billing") priority: int = Field(ge=1, le=5) suggested_assignee: str | None auto_reply: str = Field(max_length=500) async def process_ticket(text: str): # Step 1: Classify (cheap, fast) classification = await extract(text, QuickClassification, model="gpt-5-mini") if classification.needs_human: return route_to_human(text) # Step 2: Extract details (only if bug) details = None if classification.type == TicketType.bug: details = await extract(text, BugReport, model="gpt-5") # Step 3: Route context = f"Type: {classification.type}. " if details: context += f"Severity: {details.severity}. Steps: {details.steps_to_reproduce}" routing = await extract(context, RoutingDecision, model="gpt-5-mini") return { "classification": classification, "details": details, "routing": routing, }

Cost Optimization: Structured Output Isn't Free

Structured output adds overhead. Here's what it costs:

Cost factors for structured output:

1. Schema tokens: The JSON schema is included in the system prompt.
   Simple schema (3 fields):  ~50 tokens  ($0.00001)
   Complex schema (20 fields): ~500 tokens ($0.0001)
   Very complex (nested):      ~2000 tokens ($0.0004)

2. Output tokens: Structured output generates more tokens than free text.
   "The sentiment is positive" = 5 tokens
   {"sentiment": "positive"}  = 7 tokens (~40% more)
   Full structured response = 2-3x the tokens of a free text summary

3. Latency overhead: Constrained decoding adds ~10-30% latency.

Monthly cost impact (1M requests/day):
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
Approach               Tokens/req  Cost/month  Latency
Free text response     50          $1,500      200ms
Simple structured      70          $2,100      250ms
Complex structured     200         $6,000      400ms

Savings strategy:
  โ†’ Use structured output ONLY where you need it
  โ†’ Use free text for summaries, structured for data extraction
  โ†’ Cache responses aggressively (same input = same output)
  โ†’ Use gpt-5-mini for classification, gpt-5 for complex extraction

The Decision Framework

Not everything needs structured output. Here's when to use it:

Use Structured Output when:
  โœ… Output feeds directly into code (API responses, database inserts)
  โœ… You need type guarantees (numbers must be numbers, not strings)
  โœ… Multiple downstream consumers depend on consistent format
  โœ… You're building automated pipelines (no human in the loop)
  โœ… Data extraction from unstructured text (emails, documents, logs)

Don't use Structured Output when:
  โŒ Output is shown directly to users (chat, content generation)
  โŒ You need creative, free-form responses
  โŒ The schema would be more complex than the task
  โŒ You're prototyping and the schema is changing daily
  โŒ Cost is a major concern and free text works fine

What's Coming Next

2026 Q1โ€“Q2 (Now)

  • โœ… OpenAI structured output GA with streaming
  • โœ… Anthropic tool use stable across Claude Sonnet/Opus
  • โœ… Gemini 2.5 native JSON mode with schema enforcement
  • ๐Ÿ”„ Pydantic v3 beta with native LLM integration hooks
  • ๐Ÿ”„ Zod v4 with improved JSON Schema compatibility

2026 Q3โ€“Q4

  • Cross-provider schema portability (one schema, any LLM)
  • Streaming partial objects with field-level callbacks
  • Schema auto-generation from TypeScript interfaces (no Zod needed)
  • Constrained decoding for images and audio (multimodal structured output)

2027 and Beyond

  • Structured output becomes the default (free text becomes opt-in)
  • LLMs that can negotiate schema changes at runtime
  • Embedded validation directly in model weights (no FSM needed)

Conclusion

Structured output in 2026 is no longer optional for production LLM applications. The days of regex-parsing GPT responses and praying are over.

The key takeaways:

  1. Use native structured output (OpenAI's .parse(), Gemini's response_schema). Don't roll your own JSON parser.
  2. Always validate with Pydantic or Zod, even when the provider guarantees schema compliance. Business logic validation catches what JSON Schema can't.
  3. Watch the cost. Complex schemas are expensive. Break them into smaller, parallelized calls.
  4. Handle edge cases: refusals, truncation, empty arrays, and enum confusion will bite you in production.
  5. Build fallback chains. No single provider is 100% reliable. Use multi-provider patterns for critical paths.

The real question isn't "should I use structured output?" It's "why are you still parsing free text with regex in 2026?"

LLMStructured OutputOpenAIAnthropicGeminiPydanticZodAI EngineeringTypeScriptPython

Explore Related Tools

Try these free developer tools from Pockit