Skip to content

Evals

ORCA Framework evals score trajectories, not just final outputs. A result can look good and still fail because the agent ignored scope, skipped the spec, used the wrong artifacts, hid uncertainty, or looped without producing evidence.

Purpose

Use evals to answer:

  • Did the agent ask the right questions?
  • Did it produce a usable spec or plan?
  • Did it stay inside approved scope?
  • Did it use the right artifacts and context boundaries?
  • Did it verify the work appropriately?
  • Did it hand off clearly?
  • Did it avoid avoidable loops or unnecessary work?
  • Did it produce a final result that would help a maintainer ship safely?

Evaluation Model

Each eval case should include:

  • scenario and setup
  • target command or workflow
  • expected artifacts
  • pass/fail checks
  • rubric dimensions
  • reviewer notes

Use templates/eval-case.md for individual cases and templates/eval-report.md for results. Required fields are defined in templates/contracts/eval-contract.md.

For onboarding and spec quality comparisons, use the dedicated benchmark pack in docs/benchmark-pack.md and benchmarks/onboarding-spec/.

Trajectory Dimensions

Recommended dimensions:

  • Intake quality
  • Spec quality
  • Scope control
  • Artifact discipline
  • Verification discipline
  • Handoff quality
  • Efficiency
  • Final usefulness

Scoring

ORCA Framework supports both:

  • Pass/fail checks for hard requirements
  • Rubric scores for judgment-heavy dimensions

Use pass/fail for things like:

  • a required spec was produced
  • approval was requested before risky work
  • QA mode boundaries were preserved
  • the run recorded a stop reason

Use rubric scoring when grading:

  • question quality
  • clarity of reasoning
  • usefulness of the plan
  • strength of verification choices

Eval Sets

Start with small curated eval sets that a maintainer can actually review. The recommended starter set lives at examples/evals/starter-set.md and covers 12 issue-shaped scenarios.

Reporting

Each eval run should produce a human-reviewable report with:

  • scope of the eval run
  • case-by-case outcomes
  • common failure patterns
  • suggested framework improvements

When workflow efficiency matters, pair evals with per-run accounting from docs/workflow-accounting.md. When shared-state, checkpoint, or inspection artifacts exist, use them as additional trajectory evidence rather than ignoring the coordination layer. When portable schemas exist, use them to validate structural consistency before or alongside the judgment-heavy evaluation pass.

Portable eval-adjacent inputs are especially useful when comparing:

  • artifact shape drift across versions
  • schema-backed mappings into external eval tools
  • whether a candidate workflow is emitting structurally comparable artifacts

Relationship To QA

QA checks whether the product behaves correctly. Evals check whether the agent workflow behaved correctly.

Both matter. QA can pass while the workflow still fails the framework standard.