Evals¶

ORCA Framework evals score trajectories, not just final outputs. A result can look good and still fail because the agent ignored scope, skipped the spec, used the wrong artifacts, hid uncertainty, or looped without producing evidence.

Purpose¶

Use evals to answer:

Did the agent ask the right questions?
Did it produce a usable spec or plan?
Did it stay inside approved scope?
Did it use the right artifacts and context boundaries?
Did it verify the work appropriately?
Did it hand off clearly?
Did it avoid avoidable loops or unnecessary work?
Did it produce a final result that would help a maintainer ship safely?

Evaluation Model¶

Each eval case should include:

scenario and setup
target command or workflow
expected artifacts
pass/fail checks
rubric dimensions
reviewer notes

Use templates/eval-case.md for individual cases and templates/eval-report.md for results. Required fields are defined in templates/contracts/eval-contract.md.

For onboarding and spec quality comparisons, use the dedicated benchmark pack in docs/benchmark-pack.md and benchmarks/onboarding-spec/.

Trajectory Dimensions¶

Recommended dimensions:

Intake quality
Spec quality
Scope control
Artifact discipline
Verification discipline
Handoff quality
Efficiency
Final usefulness

Scoring¶

ORCA Framework supports both:

Pass/fail checks for hard requirements
Rubric scores for judgment-heavy dimensions

Use pass/fail for things like:

a required spec was produced
approval was requested before risky work
QA mode boundaries were preserved
the run recorded a stop reason

Use rubric scoring when grading:

question quality
clarity of reasoning
usefulness of the plan
strength of verification choices

Eval Sets¶

Start with small curated eval sets that a maintainer can actually review. The recommended starter set lives at examples/evals/starter-set.md and covers 12 issue-shaped scenarios.

Reporting¶

Each eval run should produce a human-reviewable report with:

scope of the eval run
case-by-case outcomes
common failure patterns
suggested framework improvements

When workflow efficiency matters, pair evals with per-run accounting from docs/workflow-accounting.md. When shared-state, checkpoint, or inspection artifacts exist, use them as additional trajectory evidence rather than ignoring the coordination layer. When portable schemas exist, use them to validate structural consistency before or alongside the judgment-heavy evaluation pass.

Portable eval-adjacent inputs are especially useful when comparing:

artifact shape drift across versions
schema-backed mappings into external eval tools
whether a candidate workflow is emitting structurally comparable artifacts

Relationship To QA¶

QA checks whether the product behaves correctly. Evals check whether the agent workflow behaved correctly.

Both matter. QA can pass while the workflow still fails the framework standard.