Documentation
Concepts

Compaction

Automatic conversation context management using provider-native compaction APIs. Keeps long-running conversations coherent without exceeding context limits.

Compaction automatically summarizes older conversation history when approaching a model's context window limit. This allows agents to handle arbitrarily long conversations without losing important context.

How Compaction Works

Turn N: Conversation grows past token thresholdProvider generates a summary of older messagesSummary replaces the old messages in context
Turn N+1: Agent sees summary + recent messages (reduced token count)

Compaction is provider-native — it uses each LLM provider's built-in mechanisms rather than a custom summarization layer:

ProviderMechanismHow It Works
Anthropic (Claude)Server-side compact_20260112 betaThe API automatically summarizes when input tokens exceed a configurable trigger threshold. Returns a compaction block that replaces older messages.
OpenAI (GPT)Summary-based fallbackAfter a turn exceeds the threshold, a separate gpt-5.4-nano call generates a conversation summary. The summary is injected as a system message on subsequent turns.

Configuration

Compaction is configured per-agent via the API or dashboard:

SettingDefaultDescription
enabledfalseTurn compaction on/off
anthropicTriggerTokens150000Token threshold for Anthropic compaction (min: 50,000)
anthropicInstructionsnullCustom summarization prompt (e.g., "preserve all code blocks")
anthropicPauseAfterfalsePause after compaction for custom content insertion
openaiCompactThreshold100000Token threshold for OpenAI summary generation (min: 1,000)

API

# Get compaction config
GET /api/agents/{agentId}/compaction

# Update compaction config
PUT /api/agents/{agentId}/compaction
{
  "enabled": true,
  "anthropicTriggerTokens": 100000,
  "openaiCompactThreshold": 75000
}

SDK

// Read config
const config = await client.getCompactionConfig(agentId);

// Enable compaction
await client.updateCompactionConfig(agentId, {
  enabled: true,
  anthropicTriggerTokens: 100000,
});

State Persistence

Compaction state is stored per-thread in thread_compaction_state:

FieldDescription
providerWhich provider performed the compaction (anthropic or openai)
compaction_summaryThe generated summary text
compacted_atWhen compaction last occurred
compaction_countHow many times this thread has been compacted
pre_compaction_tokensInput tokens before compaction
post_compaction_tokensInput tokens after compaction

The summary is also written to threads.summary for backward compatibility with agents that don't use compaction.

Compaction vs Memory

CompactionMemory
PurposeReduce context window usageStore discrete retrievable facts
ScopePer-thread conversation historyPer-agent, per-thread, or per-resource
TriggerToken threshold exceededAuto-extraction or explicit tool call
RetrievalAutomatically prepended to messagesSemantic search injection
PersistenceReplaces old messages with summaryStored indefinitely in memories table

These systems are complementary: compaction keeps the context window manageable, while memory provides long-term recall of specific facts across threads.

Cost Tracking

Compaction generates additional tokens that are tracked separately in usage analytics:

  • compaction_input_tokens — tokens sent to the compaction/summary model
  • compaction_output_tokens — tokens generated by the compaction/summary

These tokens are included in the estimated_cost_usd value returned in the done SSE event. The billing model differs by provider:

ProviderCompaction ModelRate
AnthropicSame as the main modelMain model's per-token rate
OpenAIgpt-5.4-nano$0.20 / $1.25 per 1M tokens

In the cost_events ledger, main-LLM and compaction costs are recorded as separate rows so that per-token-rate analysis stays accurate. The usage_events table stores the combined total for backward compatibility.

How History Limits Work

All providers now share a single history cap: the 500 most-recent messages are fetched (newest first, then reversed to chronological order). Older turns beyond this cap are handled by compaction when enabled.

Compaction StateBehavior
Disabled500 most-recent messages sent to the model
Enabled500 most-recent messages sent; compaction summarizes older context when the token threshold is exceeded

This replaced the previous per-provider limits (50 for OpenAI, 200 for Anthropic) which fetched the oldest N messages — effectively dropping the most recent turns once a thread exceeded the cap.

Example Flow (Anthropic)

  1. User enables compaction with anthropicTriggerTokens: 100000
  2. After 40 turns, input tokens reach 105,000
  3. Anthropic API detects threshold is exceeded
  4. API generates a summary of older messages (~3,500 tokens)
  5. Summary is returned as a compaction block
  6. Flapjack persists the summary in thread_compaction_state and threads.summary
  7. On turn 41, the compaction summary replaces the older messages
  8. Input tokens drop to ~25,000 (summary + recent messages)
  9. Conversation continues with full context awareness

Example Flow (OpenAI)

  1. User enables compaction with openaiCompactThreshold: 80000
  2. After 35 turns, input tokens reach 85,000
  3. After the response is streamed, Flapjack calls gpt-5.4-nano with the conversation
  4. A ~1,500 token summary is generated (with a 15-second timeout)
  5. Summary is persisted in thread_compaction_state and threads.summary
  6. On turn 36, the summary is injected as a system message
  7. Combined with the existing threads.summary mechanism, context is preserved
Docs last updated May 11, 2026