caching
Flapjack opts every agent into prompt caching by default — there is no config to enable. This doc explains how it works under the hood, what you can tune, and how to read the metrics.
What gets cached
For every turn, Flapjack marks up to four cache breakpoints on the LLM request:
- Tool definitions — the entire
tools[]block is cached as one prefix. Adding/removing/editing any tool invalidates the whole tools cache and everything downstream. - System prompt —
stable_preambleplus any RAG / memory / plan context injected for this turn. Changing the preamble or the injected context invalidates the system cache and everything downstream. - Compaction summary — when a thread has been compacted, the prior summary is injected as the first user content block and cached so the compressed history survives across turns.
- Last user message — the rolling tail of the conversation. This makes multi-turn chats progressively cheaper.
The order matters: a change at any level invalidates that level and every
level after it. Anthropic's tools → system → messages ordering is
strict; OpenAI matches whatever prefix it has seen before, in 128-token
chunks.
Provider-specific behaviour
Anthropic (Claude)
-
Default ttl: 5 minutes, ephemeral. Cache writes cost 1.25x base input price; cache reads cost 0.1x.
-
Optional ttl: 1 hour. Cache writes cost 2x base. Use this when requests on the same prefix arrive less often than every 5 minutes but more often than every hour (e.g. cron-driven runners). Opt in via
anthropicOverrides.cache.ttl: '1h'. -
Per-model minimum prompt size — caching only triggers above a threshold:
Model Min cacheable tokens claude-opus-4-7 / 4-6 / 4-5 4096 claude-sonnet-4-6 2048 claude-haiku-4-5 4096 older claude-* (4 / 3.7 / 3.5) 1024 — 2048 Below the threshold the request is silently processed without caching. Look at
usage.cache_read_tokens/usage.cache_write_tokenson thedoneevent — if both are 0, the prompt was too small. -
4-breakpoint limit per request. We use exactly 4 (tools, system, compaction, last user) when all are present; 3 otherwise.
OpenAI
- Caching is fully automatic for prompts >= 1024 tokens, matched in
128-token chunks, on
gpt-4oand newer. - Cache reads are priced at 0.1x base input (90% discount). Writes are free — there is no separate write tier.
- We pass
prompt_cache_key(defaults tothread_id) so requests on the same conversation hit the same backend shard. - We pass
safety_identifier(defaults toorg_id) — OpenAI uses this as a stable per-end-user identifier for safety reporting and tenant isolation.
Per-message overrides
await client.sendMessage(threadId, content, {
anthropicOverrides: {
cache: { ttl: '1h', cacheTools: true }, // 1-hour cache; defaults true otherwise
},
openaiOverrides: {
promptCacheKey: `agent-template:${templateId}`, // share cache across threads
safetyIdentifier: hashedUserId,
},
});
Set anthropicOverrides.cache.disabled: true (or openaiOverrides.disabled: true)
to skip cache_control entirely — useful for debugging cache invalidation.
Reading the metrics
Both providers report cache token counts on every response. They land on
the SSE done event as:
{
type: 'done',
usage: {
total_input_tokens: 12345, // un-cached input (Anthropic) or full input incl. cached (OpenAI)
cache_read_tokens: 11000, // tokens served from cache (priced at 0.1x base)
cache_write_tokens: 1200, // tokens written to cache this turn (Anthropic only)
cache_write_5m_tokens: 1200, // ttl breakdown (Anthropic only)
cache_write_1h_tokens: 0, // ttl breakdown (Anthropic only)
estimated_cost_usd: 0.0023, // already credited for the cache discount
}
}
The dashboard analytics page aggregates these into per-agent cache-hit
ratios. A healthy agent with a stable preamble should show
cache_read_tokens / total_input_tokens >= 0.8 after the first turn of
any conversation.
Anthropic workspace isolation (Feb 2026 onward)
Anthropic's caches are now isolated per workspace rather than per organization. If you split traffic across multiple Anthropic workspaces, each one has its own cache pool and you'll see one cold turn per workspace per prefix. Bedrock and Vertex AI continue to use organization-level isolation.
Sources: