How to Cut Your LLM API Costs by 90%: Caching, Routing, and Prompt Engineering That Actually Work

Your LLM API bill just hit $2,400 this month. Last month it was$ 800. The month before, $300. Your product is growing, your users love the AI features, and your finance team is starting to ask uncomfortable questions.

Sound familiar? You're not alone. In 2026, the average AI-powered SaaS product spends 30-50% of its infrastructure budget on LLM API calls. And most of that spending is completely unnecessary.

Here's the uncomfortable truth: most applications send redundant queries, use expensive models for trivial tasks, and transmit far more tokens than needed. The fix isn't to use AI less — it's to use it smarter.

This guide covers five battle-tested strategies that can cut your LLM API costs by 70-90% without degrading quality. These aren't theoretical tricks. Every technique here comes with production-ready code, real cost math, and honest trade-off analysis.

Understanding the Cost Landscape First

Before optimizing anything, you need to understand how you're actually being charged. Most developers have a vague sense that "tokens cost money" without understanding the full picture.

The Token Economics

Every major LLM provider charges per token, but the math isn't straightforward:

2026 Pricing Snapshot (per 1M tokens):

┌─────────────────────────┬──────────┬───────────┐
│ Model                   │  Input   │  Output   │
├─────────────────────────┼──────────┼───────────┤
│ GPT-4.1                 │  $1.00   │   $4.00   │
│ GPT-4.1 mini            │  $0.20   │   $0.80   │
│ GPT-4.1 nano            │  $0.05   │   $0.20   │
│ Claude Opus 4.6         │  $5.00   │  $25.00   │
│ Claude Sonnet 4.6       │  $3.00   │  $15.00   │
│ Claude Haiku 4.5        │  $1.00   │   $5.00   │
│ Gemini 2.5 Pro          │  $1.25   │  $10.00   │
│ Gemini 2.5 Flash        │  $0.30   │   $2.50   │
│ DeepSeek V4             │  $0.30   │   $0.50   │
└─────────────────────────┴──────────┴───────────┘

Two things jump out immediately:

Output tokens cost 3-5x more than input tokens. A 200-token prompt that generates a 2,000-token response costs far more than you'd expect just from the input.
The price spread between tiers is enormous. Claude Opus 4.6 is 125x more expensive per output token than GPT-4.1 nano. For many tasks, the cheaper model works just as well.

Where the Money Actually Goes

In a typical AI-powered application, costs break down like this:

Typical LLM Cost Distribution:

  Repeated/similar queries:     35-45%  ← Cacheable
  Simple tasks on premium models: 20-30%  ← Routable to cheaper models
  Bloated prompts:               15-20%  ← Compressible
  Unavoidable unique complex queries: 10-20%  ← Optimize output length

That means 60-80% of your LLM spending is reducible without touching your product quality. Let's go through each strategy.

Strategy 1: Semantic Caching

This is the highest-ROI optimization you can make. If your application handles any repeating patterns — customer support, content generation, code assistance, search — you're probably making identical or near-identical API calls thousands of times.

The Problem with Naive Caching

Traditional caching uses exact string matching. But LLM queries are almost never identical. Consider these three user inputs:

"How do I reset my password?"
"How can I change my password?"
"I forgot my password, how to reset it?"

These are semantically identical — they should all return the same response. But an exact-match cache treats them as three separate, billable API calls.

How Semantic Caching Works

Semantic caching converts queries into embedding vectors, then uses cosine similarity to find cached responses that are "close enough" in meaning:

import { OpenAI } from 'openai';
import { createClient } from 'redis';
import { Index } from '@upstash/vector';

const openai = new OpenAI();
const redis = createClient();
const vectorIndex = new Index({
  url: process.env.UPSTASH_VECTOR_URL!,
  token: process.env.UPSTASH_VECTOR_TOKEN!,
});

interface CachedResponse {
  response: string;
  model: string;
  timestamp: number;
  hitCount: number;
}

const SIMILARITY_THRESHOLD = 0.92; // Tune this carefully
const CACHE_TTL = 86400; // 24 hours

async function queryWithSemanticCache(
  prompt: string,
  systemPrompt: string,
  model: string = 'gpt-4.1-mini'
): Promise<{ response: string; cached: boolean; savings: number }> {
  // 1. Generate embedding for the incoming query
  const embedding = await openai.embeddings.create({
    model: 'text-embedding-3-small',
    input: `${systemPrompt}\n${prompt}`,
  });
  const vector = embedding.data[0].embedding;

  // 2. Search for semantically similar cached queries
  const results = await vectorIndex.query({
    vector,
    topK: 1,
    includeMetadata: true,
  });

  // 3. Check if we have a close enough match
  if (results.length > 0 && results[0].score >= SIMILARITY_THRESHOLD) {
    const cacheKey = results[0].id;
    const cached = await redis.get(`llm:${cacheKey}`);

    if (cached) {
      const parsed: CachedResponse = JSON.parse(cached);
      // Update hit count for analytics
      parsed.hitCount++;
      await redis.set(`llm:${cacheKey}`, JSON.stringify(parsed), {
        EX: CACHE_TTL,
      });

      return {
        response: parsed.response,
        cached: true,
        savings: estimateCost(prompt, parsed.response, model),
      };
    }
  }

  // 4. Cache miss — make the actual API call
  const completion = await openai.chat.completions.create({
    model,
    messages: [
      { role: 'system', content: systemPrompt },
      { role: 'user', content: prompt },
    ],
  });

  const response = completion.choices[0].message.content!;

  // 5. Store in both vector index and Redis
  const cacheId = crypto.randomUUID();
  await vectorIndex.upsert({
    id: cacheId,
    vector,
    metadata: { prompt: prompt.slice(0, 200) },
  });

  await redis.set(
    `llm:${cacheId}`,
    JSON.stringify({
      response,
      model,
      timestamp: Date.now(),
      hitCount: 0,
    } satisfies CachedResponse),
    { EX: CACHE_TTL }
  );

  return { response, cached: false, savings: 0 };
}

The Similarity Threshold is Everything

The SIMILARITY_THRESHOLD is the most critical tuning parameter. Get it wrong and you'll either:

Too high (>0.97): Almost nothing matches. Cache hit rate near zero. You've built an expensive no-op.
Too low (<0.85): Unrelated queries return cached responses. Users get wrong answers. Your product breaks.

The sweet spot depends on your use case:

Use Case	Recommended Threshold	Why
FAQ / Customer Support	0.90 - 0.93	Questions cluster tightly around topics
Code Generation	0.95 - 0.97	Small prompt differences lead to very different code
Content Summarization	0.88 - 0.92	Same document → same summary regardless of phrasing
Creative Writing	Don't cache	Every response should be unique

Real Cost Impact

Here's the math for a customer support chatbot processing 100,000 queries/month:

Without caching:
  100,000 queries × ~500 input tokens × ~800 output tokens
  = 50M input + 80M output tokens
  Using GPT-4.1 mini: (50 × $0.20) + (80 × $0.80) = $74/month

With semantic caching (65% hit rate):
  35,000 actual API calls + 65,000 cache hits
  Embedding costs: 100,000 × ~20 tokens × $0.02/1M = $0.04
  API calls: (17.5M × $0.20 + 28M × $0.80) / 1M = $25.90/month
  Vector DB: ~$10/month (Upstash, Pinecone, etc.)
  Redis: ~$5/month

  Total: ~$41/month → 45% savings

And that's with a conservative 65% hit rate. Applications with repetitive query patterns (support bots, docs search) regularly hit 80%+ cache rates, pushing savings above 70%.

Strategy 2: Intelligent Model Routing

This is the optimization most teams skip because it feels like complexity for its own sake. But the math is compelling: you're probably using a $15/M-token model for tasks that a$ 0.05/M-token model handles just as well.

The Routing Architecture

The idea is simple: before sending a request to an expensive model, classify it and route it to the cheapest model that can handle it competently.

type ComplexityLevel = 'trivial' | 'simple' | 'moderate' | 'complex';

interface RouteConfig {
  model: string;
  maxTokens: number;
  temperature: number;
  costPer1MInput: number;
  costPer1MOutput: number;
}

const MODEL_ROUTES: Record<ComplexityLevel, RouteConfig> = {
  trivial: {
    model: 'gpt-4.1-nano',
    maxTokens: 256,
    temperature: 0.1,
    costPer1MInput: 0.05,
    costPer1MOutput: 0.20,
  },
  simple: {
    model: 'gpt-4.1-mini',
    maxTokens: 1024,
    temperature: 0.3,
    costPer1MInput: 0.20,
    costPer1MOutput: 0.80,
  },
  moderate: {
    model: 'claude-haiku-4.5',
    maxTokens: 2048,
    temperature: 0.5,
    costPer1MInput: 1.00,
    costPer1MOutput: 5.00,
  },
  complex: {
    model: 'claude-sonnet-4.6',
    maxTokens: 4096,
    temperature: 0.7,
    costPer1MInput: 3.00,
    costPer1MOutput: 15.00,
  },
};

async function classifyComplexity(prompt: string): Promise<ComplexityLevel> {
  // Use the cheapest model to classify — meta-optimization!
  const classification = await openai.chat.completions.create({
    model: 'gpt-4.1-nano',
    max_tokens: 10,
    temperature: 0,
    messages: [
      {
        role: 'system',
        content: `Classify the user query complexity. Respond with ONLY one word:
          - "trivial": greeting, yes/no question, simple lookup
          - "simple": single-step task, basic Q&A, formatting
          - "moderate": multi-step reasoning, comparison, analysis
          - "complex": creative writing, code generation, deep research
          Respond with the single classification word only.`,
      },
      { role: 'user', content: prompt },
    ],
  });

  const level = classification.choices[0].message.content!
    .trim()
    .toLowerCase() as ComplexityLevel;

  return MODEL_ROUTES[level] ? level : 'simple'; // Default to simple
}

async function routedQuery(
  prompt: string,
  systemPrompt: string
): Promise<{ response: string; model: string; cost: number }> {
  const complexity = await classifyComplexity(prompt);
  const route = MODEL_ROUTES[complexity];

  // Use the appropriate provider based on the routed model
  const response = await callModel(route.model, systemPrompt, prompt, {
    maxTokens: route.maxTokens,
    temperature: route.temperature,
  });

  const inputTokens = estimateTokens(systemPrompt + prompt);
  const outputTokens = estimateTokens(response);
  const cost =
    (inputTokens / 1_000_000) * route.costPer1MInput +
    (outputTokens / 1_000_000) * route.costPer1MOutput;

  return { response, model: route.model, cost };
}

The Classification Cost Paradox

"Wait — you're making an extra API call just to classify the query. Doesn't that cost more?"

Let's do the math. The classifier call uses GPT-4.1 nano with ~100 input tokens and ~5 output tokens:

Classification cost per query:
  (100 / 1M) × $0.05 + (5 / 1M) × $0.20 = $0.000006

Cost of sending everything to Claude Sonnet 4.6:
  (500 / 1M) × $3.00 + (800 / 1M) × $15.00 = $0.0135

Savings when routing 60% of queries to cheaper models:
  Without routing: $0.0135 per query
  With routing (40% Sonnet, 30% Haiku, 20% Mini, 10% Nano):
    = (0.4 × $0.0135) + (0.3 × $0.0045) + (0.2 × $0.00074) + (0.1 × $0.000185) + $0.000006
    = $0.0069 per query

  Savings: 49% per query

The classifier pays for itself 1000x over. And that's a conservative scenario — real-world routing often pushes 70%+ of queries to the cheapest tier.

Advanced: Quality-Aware Routing with Fallback

The smarter version doesn't just route — it validates that cheaper models produced acceptable output:

async function routedQueryWithFallback(
  prompt: string,
  systemPrompt: string,
  qualityThreshold: number = 0.7
): Promise<{ response: string; model: string; attempts: number }> {
  const complexity = await classifyComplexity(prompt);
  const route = MODEL_ROUTES[complexity];

  const response = await callModel(route.model, systemPrompt, prompt, {
    maxTokens: route.maxTokens,
  });

  // For non-trivial routes, verify quality with a cheap check
  if (complexity !== 'complex') {
    const qualityScore = await quickQualityCheck(prompt, response);

    if (qualityScore < qualityThreshold) {
      // Escalate to next tier
      const nextTier = getNextTier(complexity);
      const betterResponse = await callModel(
        MODEL_ROUTES[nextTier].model,
        systemPrompt,
        prompt,
        { maxTokens: MODEL_ROUTES[nextTier].maxTokens }
      );
      return { response: betterResponse, model: MODEL_ROUTES[nextTier].model, attempts: 2 };
    }
  }

  return { response, model: route.model, attempts: 1 };
}

Strategy 3: Prompt Compression

Most prompts are bloated. System prompts especially tend to grow over time as developers add instructions, examples, and edge cases. A system prompt that started at 200 tokens can balloon to 2,000 tokens — and that cost multiplies with every single request.

Audit Your System Prompts

Here's a before/after of a real production system prompt:

Before (847 tokens):

You are a helpful customer support assistant for our e-commerce platform.
You should always be polite and professional in your responses.
When a customer asks about their order, you should look up the order
details and provide them with relevant information about the status
of their order, including the tracking number if available.
If you don't have access to the order information, please let the
customer know that you're unable to look up that information and
suggest they contact our support team directly.
Please make sure to always greet the customer warmly at the beginning
of your response and thank them for reaching out at the end.
You should never provide information about our internal systems,
pricing structures, or employee information.
Always respond in the same language that the customer uses.
If a customer is frustrated or angry, acknowledge their feelings
and try to empathize with their situation before providing a solution...
(continues for another 400 tokens)

After (189 tokens):

E-commerce support agent. Rules:
- Lookup order details, provide status + tracking if available
- If no access to order data, direct to support team
- Match customer's language
- Never reveal internal systems/pricing/employee info
- For frustrated customers: acknowledge → empathize → solve
- Tone: professional, warm, concise

Same behavior, 78% fewer tokens. At 100,000 queries/month with GPT-4.1 mini, that saves:

Token savings: 658 tokens × 100,000 = 65.8M tokens/month
Cost savings: (65.8 / 1M) × $0.20 = $13.16/month just on input

$13/month sounds small, but it compounds. If you have 10 AI features, each with bloated prompts, that's$ 130/month — or $1,560/year — just from trimming system prompts.

Programmatic Prompt Compression

For dynamic prompts that include context (RAG, conversation history), compress before sending:

function compressConversationHistory(
  messages: Array<{ role: string; content: string }>,
  maxTokens: number = 2000
): Array<{ role: string; content: string }> {
  const estimated = messages.reduce(
    (sum, m) => sum + estimateTokens(m.content),
    0
  );

  if (estimated <= maxTokens) return messages;

  // Strategy: Keep first message (context) and last N messages
  // Summarize the middle
  const first = messages[0];
  const last3 = messages.slice(-3);
  const middle = messages.slice(1, -3);

  if (middle.length === 0) return messages;

  const middleSummary = middle
    .map(m => `${m.role}: ${m.content.slice(0, 50)}...`)
    .join('\n');

  return [
    first,
    {
      role: 'system',
      content: `[Conversation summary: ${middleSummary}]`,
    },
    ...last3,
  ];
}

The Output Token Trap

Remember: output tokens cost 3-5x more than input tokens. Setting max_tokens is the single easiest way to control costs:

// ❌ Bad: Letting the model ramble
const response = await openai.chat.completions.create({
  model: 'gpt-4.1-mini',
  messages: [{ role: 'user', content: 'Summarize this article' }],
  // No max_tokens = model decides length
});

// ✅ Good: Constrain output length
const response = await openai.chat.completions.create({
  model: 'gpt-4.1-mini',
  messages: [{ role: 'user', content: 'Summarize this article in 3 sentences.' }],
  max_tokens: 200, // Hard limit as safety net
});

For structured outputs, use JSON mode with schemas — it produces consistently shorter, more predictable responses:

const response = await openai.chat.completions.create({
  model: 'gpt-4.1-mini',
  response_format: {
    type: 'json_schema',
    json_schema: {
      name: 'sentiment_analysis',
      schema: {
        type: 'object',
        properties: {
          sentiment: { type: 'string', enum: ['positive', 'negative', 'neutral'] },
          confidence: { type: 'number' },
          keywords: { type: 'array', items: { type: 'string' }, maxItems: 5 },
        },
        required: ['sentiment', 'confidence', 'keywords'],
      },
    },
  },
  messages: [{ role: 'user', content: `Analyze: "${text}"` }],
  max_tokens: 100,
});

Strategy 4: Batch Processing

If your workload isn't time-sensitive, batch APIs offer steep discounts. OpenAI's Batch API charges 50% less than real-time calls. For background processing — nightly content generation, bulk classification, data enrichment — this is free money.

When to Use Batch Processing

Real-time required:
  ✗ User-facing chat
  ✗ Live code completion
  ✗ Interactive search

Batch-friendly:
  ✓ Nightly content summarization
  ✓ Bulk email classification
  ✓ Weekly report generation
  ✓ Data labeling and enrichment
  ✓ Content moderation backlog
  ✓ Translation batches

Implementation

import { OpenAI } from 'openai';
import * as fs from 'fs';

const openai = new OpenAI();

interface BatchItem {
  custom_id: string;
  method: 'POST';
  url: '/v1/chat/completions';
  body: {
    model: string;
    messages: Array<{ role: string; content: string }>;
    max_tokens: number;
  };
}

async function submitBatchJob(
  items: Array<{ id: string; prompt: string }>,
  systemPrompt: string,
  model: string = 'gpt-4.1-mini'
): Promise<string> {
  // 1. Create JSONL file
  const batchLines: string[] = items.map(item => {
    const batchItem: BatchItem = {
      custom_id: item.id,
      method: 'POST',
      url: '/v1/chat/completions',
      body: {
        model,
        messages: [
          { role: 'system', content: systemPrompt },
          { role: 'user', content: item.prompt },
        ],
        max_tokens: 1024,
      },
    };
    return JSON.stringify(batchItem);
  });

  const filePath = `/tmp/batch-${Date.now()}.jsonl`;
  fs.writeFileSync(filePath, batchLines.join('\n'));

  // 2. Upload file
  const file = await openai.files.create({
    file: fs.createReadStream(filePath),
    purpose: 'batch',
  });

  // 3. Create batch
  const batch = await openai.batches.create({
    input_file_id: file.id,
    endpoint: '/v1/chat/completions',
    completion_window: '24h',
  });

  return batch.id; // Poll this for completion
}

async function checkBatchStatus(batchId: string) {
  const batch = await openai.batches.retrieve(batchId);

  if (batch.status === 'completed') {
    const file = await openai.files.content(batch.output_file_id!);
    const text = await file.text();
    const results = text
      .split('\n')
      .filter(Boolean)
      .map(line => JSON.parse(line));
    return results;
  }

  return { status: batch.status, progress: `${batch.request_counts.completed}/${batch.request_counts.total}` };
}

Cost Impact

For a nightly job that processes 10,000 items:

Real-time API (GPT-4.1 mini):
  10,000 × (500 input + 800 output tokens)
  = (5M × $0.20 + 8M × $0.80) / 1M = $7.40

Batch API (50% discount):
  = $3.70

Monthly savings: ~$111 (assuming daily runs)
Annual savings: ~$1,332

Strategy 5: LLM Gateway and Observability

You can't optimize what you can't measure. An LLM gateway sits between your application and the API providers, giving you:

Per-feature cost attribution — know which feature is burning money
Automatic retry and failover — if OpenAI is down, route to Anthropic
Rate limiting — prevent runaway costs from bugs or abuse
Usage analytics — spot optimization opportunities

Building a Lightweight Gateway

interface LLMRequest {
  featureId: string;
  prompt: string;
  systemPrompt: string;
  model?: string;
  maxTokens?: number;
  userId?: string;
}

interface LLMMetrics {
  featureId: string;
  model: string;
  inputTokens: number;
  outputTokens: number;
  cost: number;
  latencyMs: number;
  cached: boolean;
  timestamp: number;
}

class LLMGateway {
  private metrics: LLMMetrics[] = [];
  private rateLimiter: Map<string, { count: number; resetAt: number }> = new Map();

  async query(request: LLMRequest): Promise<string> {
    // 1. Rate limiting
    this.enforceRateLimit(request.featureId);

    // 2. Budget check
    await this.checkBudget(request.featureId);

    const startTime = Date.now();

    // 3. Try semantic cache first
    const cached = await queryWithSemanticCache(
      request.prompt,
      request.systemPrompt,
      request.model
    );

    if (cached.cached) {
      this.recordMetrics({
        featureId: request.featureId,
        model: request.model || 'gpt-4.1-mini',
        inputTokens: 0,
        outputTokens: 0,
        cost: 0,
        latencyMs: Date.now() - startTime,
        cached: true,
        timestamp: Date.now(),
      });
      return cached.response;
    }

    // 4. Route to optimal model
    const result = await routedQuery(request.prompt, request.systemPrompt);

    // 5. Record metrics
    this.recordMetrics({
      featureId: request.featureId,
      model: result.model,
      inputTokens: estimateTokens(request.prompt + request.systemPrompt),
      outputTokens: estimateTokens(result.response),
      cost: result.cost,
      latencyMs: Date.now() - startTime,
      cached: false,
      timestamp: Date.now(),
    });

    return result.response;
  }

  getFeatureCosts(period: 'day' | 'week' | 'month'): Record<string, number> {
    const cutoff = {
      day: 86400000,
      week: 604800000,
      month: 2592000000,
    }[period];

    const recent = this.metrics.filter(
      m => Date.now() - m.timestamp < cutoff
    );

    return recent.reduce((acc, m) => {
      acc[m.featureId] = (acc[m.featureId] || 0) + m.cost;
      return acc;
    }, {} as Record<string, number>);
  }

  private enforceRateLimit(featureId: string) {
    const limit = this.rateLimiter.get(featureId);
    const now = Date.now();

    if (limit && now < limit.resetAt && limit.count >= 1000) {
      throw new Error(`Rate limit exceeded for feature: ${featureId}`);
    }

    if (!limit || now >= limit.resetAt) {
      this.rateLimiter.set(featureId, { count: 1, resetAt: now + 60000 });
    } else {
      limit.count++;
    }
  }

  private async checkBudget(featureId: string) {
    const monthlyCosts = this.getFeatureCosts('month');
    const featureBudget = await getFeatureBudget(featureId); // From your config

    if ((monthlyCosts[featureId] || 0) >= featureBudget) {
      throw new Error(
        `Monthly budget exceeded for ${featureId}: $${monthlyCosts[featureId]?.toFixed(2)} / $${featureBudget}`
      );
    }
  }

  private recordMetrics(metrics: LLMMetrics) {
    this.metrics.push(metrics);
    // In production: send to your analytics pipeline
  }
}

The Dashboard You Need

At minimum, track these per feature, per day:

┌─────────────────────────────────────────────┐
│ LLM Cost Dashboard - March 2026            │
├──────────────┬────────┬────────┬────────────┤
│ Feature      │ Calls  │  Cost  │ Cache Hit  │
├──────────────┼────────┼────────┼────────────┤
│ Chat Support │ 45,231 │ $89.40 │    72%     │
│ Code Review  │ 12,847 │ $67.20 │    34%     │
│ Summarizer   │  8,392 │ $12.10 │    81%     │
│ Search       │ 31,094 │ $23.40 │    68%     │
│ Translation  │  3,201 │  $8.90 │    45%     │
├──────────────┼────────┼────────┼────────────┤
│ Total        │100,765 │$201.00 │    63%     │
└──────────────┴────────┴────────┴────────────┘

 Previous month without optimizations: $890
 Current month with all strategies: $201
 Total savings: $689 (77%)

Putting It All Together: The Optimization Stack

These strategies compound. Here's how they layer together:

Incoming LLM Request
        │
        ▼
  ┌─────────────┐
  │ Rate Limiter │─── Over limit? → Error
  └──────┬──────┘
         │
         ▼
  ┌──────────────┐
  │ Budget Check  │─── Over budget? → Error
  └──────┬───────┘
         │
         ▼
  ┌────────────────────┐
  │ Semantic Cache      │─── Hit? → Return cached
  └──────────┬─────────┘
             │ Miss
             ▼
  ┌────────────────────┐
  │ Prompt Compression  │─── Trim tokens
  └──────────┬─────────┘
             │
             ▼
  ┌────────────────────┐
  │ Model Router        │─── Pick cheapest capable model
  └──────────┬─────────┘
             │
             ▼
  ┌────────────────────┐
  │ API Call + Metrics  │─── Record everything
  └──────────┬─────────┘
             │
             ▼
  ┌────────────────────┐
  │ Cache Response      │─── Store for future hits
  └────────────────────┘

Realistic Combined Savings

Let's model a real application — a developer documentation assistant processing 200,000 queries/month, currently using Claude 4 Sonnet for everything:

Baseline (no optimization):
  200,000 queries × 600 avg input tokens × 1,200 avg output tokens
  Input: 120M tokens × $3.00/1M = $360
  Output: 240M tokens × $15.00/1M = $3,600
  Total: $3,960/month

After optimization stack:
  1. Semantic caching (70% hit rate): 140,000 queries eliminated
     Remaining: 60,000 queries
     Embedding costs: $0.24
     Cache infra: $15/month

  2. Model routing on remaining 60,000:
     - 15% complex → Claude Sonnet 4.6 (9,000 queries)
     - 25% moderate → Claude Haiku 4.5 (15,000 queries)
     - 40% simple → GPT-4.1 mini (24,000 queries)
     - 20% trivial → GPT-4.1 nano (12,000 queries)

  3. Prompt compression (30% token reduction across the board)

  Cost calculation:
     Sonnet: 9K × (420 in × $3.00 + 840 out × $15.00)/1M = $124.74
     Haiku: 15K × (420 × $1.00 + 840 × $5.00)/1M = $69.30
     Mini: 24K × (420 × $0.20 + 840 × $0.80)/1M = $18.14
     Nano: 12K × (420 × $0.05 + 840 × $0.20)/1M = $2.27
     Classification: 60K × $0.000006 = $0.36
     Cache infra: $15
     Embeddings: $0.24

  Total: ~$230/month

  Savings: $3,730/month (94% reduction)

Going from $3,960 to$ 230 per month. That's $44,760 annually. For many startups, this is the difference between profitability and running out of runway.

Common Mistakes to Avoid

1. Caching Without Invalidation

Cached responses go stale. If your product info changes, your support bot shouldn't serve yesterday's answers. Implement proper TTLs and manual cache busting for content updates.

2. Over-Routing to Cheap Models

If 5% of your users get degraded responses because a query was misclassified, you've saved money but lost trust. Monitor quality scores and set up the fallback pattern described above.

3. Compressing Too Aggressively

Removing context from prompts can degrade response quality in subtle ways. Always A/B test prompt changes with quality metrics before fully rolling out.

4. Ignoring Provider Differences

Each provider has strengths. Claude excels at long-form analysis. GPT-4.1 nano is unbeatable for classification. Gemini Flash has the best price-to-performance for general tasks. Don't lock yourself to one provider.

5. Forgetting About Embeddings Costs

Semantic caching requires embedding every query. At scale, embedding costs can add up. Use text-embedding-3-small (not large) and batch embedding requests where possible.

Conclusion

LLM API costs are not fixed physics — they're engineering problems with engineering solutions. The five strategies in this guide — semantic caching, intelligent routing, prompt compression, batch processing, and observability — can reduce your spending by 70-90% without degrading user experience.

Here's the priority order for implementation:

Semantic caching — highest ROI, implement first
Prompt compression — quick win, audit your system prompts today
Model routing — significant savings, moderate complexity
Observability — you need this to find your next optimization
Batch processing — easy savings for background workloads

Start with one strategy, measure the impact, and stack the next one on top. In my experience, teams that implement all five go from "we might need to raise our Series B just to pay OpenAI" to "our LLM costs are a rounding error."

Your API bill doesn't have to be scary. Make it boring.