Token Efficiency¶

ORCA Framework should treat token spend as an operational constraint, not an afterthought.

The goal is not fake precision or obsessive minimization. The goal is to avoid waste, keep long-running work bounded, and preserve quality by spending tokens where they matter.

Core Policy¶

ORCA Framework should prefer:

stable prompt prefixes that can benefit from provider-side prompt caching
short durable artifacts over replaying long histories
summaries, receipts, and shared-state snapshots over raw transcript reuse
bounded output requests instead of open-ended expansion
stage-specific budgets instead of one giant undifferentiated allowance
tools, schemas, and artifacts that reduce repeated explanation

ORCA Framework should avoid:

resending large background context that has not changed
duplicating the same instructions across multiple artifacts
pasting full traces, logs, or long docs when a citation or summary is enough
repeated retry loops that consume tokens without improving the result
using high-effort or long-output behavior for small, scoped tasks

Practical Levers¶

1. Keep Stable Prefixes Stable¶

Prompt caching only helps when the repeated prefix stays identical.

That means:

keep durable instructions at the front
keep variable run-specific material near the end
avoid needless timestamp, formatting, or ordering drift in repeated prompt blocks
keep tool schemas stable when possible

If a host or provider exposes cache controls or cache metrics, ORCA Framework should use them conservatively and measure whether they actually help.

2. Compress Before Expanding¶

Before starting another large reasoning pass, prefer:

latest receipt
current shared-state summary
run-memory summary
milestone or goal contract
narrow artifact excerpts

Do not replay a full issue thread, trace, or transcript unless the missing detail is actually blocking the work.

3. Spend By Stage¶

Different stages deserve different token behavior:

onboarding and discovery can spend more on context gathering, but should end with a compact orientation artifact
spec and milestone planning should produce concise contracts that reduce downstream re-explanation
implementation should avoid verbose narration and prioritize bounded execution
QA should spend on evidence, not repeated theorizing
long-running goal or background work should stop at milestone boundaries and emit a receipt before drifting further

See stage-budgets.md for the enforcement layer.

4. Treat Retries As Token Spend¶

Retries are hidden token burn.

If a run keeps:

rediscovering the same repo facts
re-reading the same large artifacts
repeating the same failing tool pattern
expanding outputs without changing the plan

then the workflow is inefficient even if the final answer eventually succeeds.

5. Prefer Structured Artifacts¶

Well-shaped artifacts reduce future prompt size.

Prefer:

specs over long freeform intent restatements
receipts over replaying full traces
shared state over multi-role transcript stitching
schemas and templates over re-explaining required fields each time

6. Keep Large Inputs Honest¶

Large code, logs, screenshots, and documents are sometimes justified. When they are:

send only the relevant slice when possible
downsample or crop images if full fidelity is unnecessary
quote or excerpt only the exact section needed for the next step
capture the durable conclusion in run memory, shared state, or a receipt so the next run does not need the full blob again

7. Measure What The Harness Actually Exposes¶

Use exact token or cost data when the host provides it.

If the host does not provide it:

estimate only when the estimate is operationally useful
mark the confidence clearly
fall back to time, retry burden, output size, and stage churn as the practical efficiency signals

Review Questions¶

Before approving a workflow change, ask:

Does this reduce repeated prompt mass across runs?
Does it improve cacheability by keeping repeated instructions stable?
Does it replace transcript replay with a smaller artifact?
Does it cut retries or just move them?
Does it save tokens without hiding important evidence?

Research Direction¶

The recurring ecosystem sweep should keep watching:

provider prompt-caching behavior and limits
token-aware tool-use patterns
task-budget or effort controls
better ways to summarize or checkpoint long-running work
host-specific token telemetry that ORCA Framework can surface without overclaiming precision

Sources¶

OpenAI prompt caching: https://developers.openai.com/api/docs/guides/prompt-caching
OpenAI token usage guidance: https://developers.openai.com/api/docs/guides/advanced-usage
Anthropic prompt caching: https://platform.claude.com/docs/en/build-with-claude/prompt-caching
Anthropic token-efficient tool-use and task-budget guidance: https://platform.claude.com/docs/en/docs/agents-and-tools/tool-use/token-efficient-tool-use