Skip to content

Benchmark Pack

ORCA Framework benchmark packs provide a repeatable way to compare install, onboarding, spec, and orchestration quality over time. The goal is not synthetic leaderboard theater. The goal is to catch whether workflow changes improve or degrade the actual user path.

Purpose

Use benchmark packs to compare:

  • ORCA Framework version to ORCA Framework version
  • prompt set A to prompt set B
  • workflow changes over time
  • ORCA Framework against adjacent frameworks when the comparison is grounded and repeatable

What The Pack Measures

The benchmark packs focus on:

  • install speed and clarity
  • onboarding information yield versus hassle
  • ambiguity detection
  • question quality
  • constraint capture
  • edge-case awareness
  • scope discipline
  • acceptance criteria quality
  • verification readiness
  • overall spec usability

Pack Structure

Scoring Model

Use both:

  • pass or fail checks for hard requirements
  • rubric scoring for quality dimensions

Examples of hard requirements:

  • the install completed without hidden manual recovery
  • the onboarding captured enough information to start real work
  • the workflow identified key ambiguity
  • the workflow separated goals from non-goals
  • the resulting spec contains testable acceptance criteria

Examples of rubric dimensions:

  • whether the install path was understandable to a first-time user
  • whether the onboarding asked too little or too much
  • whether the questions were high leverage
  • whether the scope stayed disciplined
  • whether the resulting spec would help build and QA

Comparison Guidance

When comparing benchmark runs:

  • use the same cases when possible
  • note workflow version, command version, and any prompt changes
  • compare both aggregate outcomes and failure patterns
  • do not treat small score differences as precise science

Evidence Standard

Benchmark reports should link to:

  • the case definition
  • the produced onboarding or spec artifact
  • the scoring notes
  • any known confounders

The benchmark is useful only if a maintainer can inspect why a run passed or failed.