Chat-Agent Evals

This folder contains a dataset-driven eval harness for web chat-agent behavior.

What is graded

  • output: final assistant text quality checks (mustContain, mustNotContain)
  • trajectory: tool-call ordering checks (strict, subset, superset, unordered)
  • trace: event-stream integrity checks (required event types, monotonic globalSeq, and tool-call lifecycle completeness)

Run

pnpm test:evals

This mode is informational and prints a full summary.

Enforce critical checks

pnpm test:evals:enforce

When EVALS_ENFORCE=1, the runner exits non-zero if any critical eval case fails.