Claude streaming with prompt caching

Anthropic SDK chat with streaming + cache_control on the system prompt — the fastest, cheapest path to a chat product.

Claude streaming with prompt caching

Prompt caching is the single biggest knob in production Claude apps. A cached system prompt cuts TTFT by ~40% and prompt cost by ~10× on repeat calls within the 5-minute TTL. Below is the minimal pattern for a streaming chat endpoint with caching.

Install

npm i @anthropic-ai/sdk

Streaming chat endpoint

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

export async function POST(req: Request) {
  const { messages } = await req.json();

  const stream = await client.messages.stream({
    model: "claude-opus-4-7",
    max_tokens: 1024,
    system: [
      {
        type: "text",
        text: `You are a helpful coding assistant for Carson's portfolio site.
You answer concisely and prefer code over prose.
${LARGE_SYSTEM_CONTEXT}`,
        cache_control: { type: "ephemeral" },
      },
    ],
    messages,
  });

  // Forward to the client as SSE
  const encoder = new TextEncoder();
  const body = new ReadableStream({
    async start(controller) {
      for await (const event of stream) {
        if (event.type === "content_block_delta" && event.delta.type === "text_delta") {
          controller.enqueue(
            encoder.encode(`data: ${JSON.stringify({ delta: event.delta.text })}\n\n`)
          );
        }
      }
      const final = await stream.finalMessage();
      controller.enqueue(
        encoder.encode(`data: ${JSON.stringify({ done: true, usage: final.usage })}\n\n`)
      );
      controller.close();
    },
  });

  return new Response(body, {
    headers: { "Content-Type": "text/event-stream", "Cache-Control": "no-cache" },
  });
}

What's worth caching

  • Long, stable system prompts (style guide, tool definitions, schema docs).
  • RAG context that gets reused across turns within the same conversation.
  • Few-shot examples — they're often the biggest contributor to a hot system prompt.

What you'll see in usage

{
  "input_tokens": 12,
  "cache_creation_input_tokens": 4321,
  "cache_read_input_tokens": 0,
  "output_tokens": 287
}

On the next call within the 5-minute window, cache_read_input_tokens jumps and cache_creation_input_tokens drops to 0. Always log these — they're the most reliable signal that caching is wired up.