Documentation
Concepts

caching

Flapjack opts every agent into prompt caching by default — there is no config to enable. This doc explains how it works under the hood, what you can tune, and how to read the metrics.

What gets cached

For every turn, Flapjack marks up to four cache breakpoints on the LLM request:

  1. Tool definitions — the entire tools[] block is cached as one prefix. Adding/removing/editing any tool invalidates the whole tools cache and everything downstream.
  2. System promptstable_preamble plus any RAG / memory / plan context injected for this turn. Changing the preamble or the injected context invalidates the system cache and everything downstream.
  3. Compaction summary — when a thread has been compacted, the prior summary is injected as the first user content block and cached so the compressed history survives across turns.
  4. Last user message — the rolling tail of the conversation. This makes multi-turn chats progressively cheaper.

The order matters: a change at any level invalidates that level and every level after it. Anthropic's toolssystemmessages ordering is strict; OpenAI matches whatever prefix it has seen before, in 128-token chunks.

Provider-specific behaviour

Anthropic (Claude)

  • Default ttl: 5 minutes, ephemeral. Cache writes cost 1.25x base input price; cache reads cost 0.1x.

  • Optional ttl: 1 hour. Cache writes cost 2x base. Use this when requests on the same prefix arrive less often than every 5 minutes but more often than every hour (e.g. cron-driven runners). Opt in via anthropicOverrides.cache.ttl: '1h'.

  • Per-model minimum prompt size — caching only triggers above a threshold:

    ModelMin cacheable tokens
    claude-opus-4-7 / 4-6 / 4-54096
    claude-sonnet-4-62048
    claude-haiku-4-54096
    older claude-* (4 / 3.7 / 3.5)1024 — 2048

    Below the threshold the request is silently processed without caching. Look at usage.cache_read_tokens / usage.cache_write_tokens on the done event — if both are 0, the prompt was too small.

  • 4-breakpoint limit per request. We use exactly 4 (tools, system, compaction, last user) when all are present; 3 otherwise.

OpenAI

  • Caching is fully automatic for prompts >= 1024 tokens, matched in 128-token chunks, on gpt-4o and newer.
  • Cache reads are priced at 0.1x base input (90% discount). Writes are free — there is no separate write tier.
  • We pass prompt_cache_key (defaults to thread_id) so requests on the same conversation hit the same backend shard.
  • We pass safety_identifier (defaults to org_id) — OpenAI uses this as a stable per-end-user identifier for safety reporting and tenant isolation.

Per-message overrides

await client.sendMessage(threadId, content, {
  anthropicOverrides: {
    cache: { ttl: '1h', cacheTools: true },  // 1-hour cache; defaults true otherwise
  },
  openaiOverrides: {
    promptCacheKey: `agent-template:${templateId}`,  // share cache across threads
    safetyIdentifier: hashedUserId,
  },
});

Set anthropicOverrides.cache.disabled: true (or openaiOverrides.disabled: true) to skip cache_control entirely — useful for debugging cache invalidation.

Reading the metrics

Both providers report cache token counts on every response. They land on the SSE done event as:

{
  type: 'done',
  usage: {
    total_input_tokens: 12345,        // un-cached input (Anthropic) or full input incl. cached (OpenAI)
    cache_read_tokens: 11000,         // tokens served from cache (priced at 0.1x base)
    cache_write_tokens: 1200,         // tokens written to cache this turn (Anthropic only)
    cache_write_5m_tokens: 1200,      // ttl breakdown (Anthropic only)
    cache_write_1h_tokens: 0,         // ttl breakdown (Anthropic only)
    estimated_cost_usd: 0.0023,       // already credited for the cache discount
  }
}

The dashboard analytics page aggregates these into per-agent cache-hit ratios. A healthy agent with a stable preamble should show cache_read_tokens / total_input_tokens >= 0.8 after the first turn of any conversation.

Anthropic workspace isolation (Feb 2026 onward)

Anthropic's caches are now isolated per workspace rather than per organization. If you split traffic across multiple Anthropic workspaces, each one has its own cache pool and you'll see one cold turn per workspace per prefix. Bedrock and Vertex AI continue to use organization-level isolation.

Sources:

Docs last updated May 11, 2026