Benchmark Pack¶

ORCA Framework benchmark packs provide a repeatable way to compare install, onboarding, spec, and orchestration quality over time. The goal is not synthetic leaderboard theater. The goal is to catch whether workflow changes improve or degrade the actual user path.

Purpose¶

Use benchmark packs to compare:

ORCA Framework version to ORCA Framework version
prompt set A to prompt set B
workflow changes over time
ORCA Framework against adjacent frameworks when the comparison is grounded and repeatable

What The Pack Measures¶

The benchmark packs focus on:

install speed and clarity
onboarding information yield versus hassle
ambiguity detection
question quality
constraint capture
edge-case awareness
scope discipline
acceptance criteria quality
verification readiness
overall spec usability

Pack Structure¶

benchmarks/onboarding-spec/README.md
benchmarks/install-onboarding-race/README.md
benchmarks/onboarding-spec/cases/
templates/benchmark-case.md
templates/benchmark-report.md

Scoring Model¶

Use both:

pass or fail checks for hard requirements
rubric scoring for quality dimensions

Examples of hard requirements:

the install completed without hidden manual recovery
the onboarding captured enough information to start real work
the workflow identified key ambiguity
the workflow separated goals from non-goals
the resulting spec contains testable acceptance criteria

Examples of rubric dimensions:

whether the install path was understandable to a first-time user
whether the onboarding asked too little or too much
whether the questions were high leverage
whether the scope stayed disciplined
whether the resulting spec would help build and QA

Comparison Guidance¶

When comparing benchmark runs:

use the same cases when possible
note workflow version, command version, and any prompt changes
compare both aggregate outcomes and failure patterns
do not treat small score differences as precise science

Evidence Standard¶

Benchmark reports should link to:

the case definition
the produced onboarding or spec artifact
the scoring notes
any known confounders

The benchmark is useful only if a maintainer can inspect why a run passed or failed.