Chat-Agent Evals
This folder contains a dataset-driven eval harness for web chat-agent behavior.
What is graded
output: final assistant text quality checks (mustContain,mustNotContain)trajectory: tool-call ordering checks (strict,subset,superset,unordered)trace: event-stream integrity checks (required event types, monotonicglobalSeq, and tool-call lifecycle completeness)
Run
pnpm test:evals
This mode is informational and prints a full summary.
Enforce critical checks
pnpm test:evals:enforce
When EVALS_ENFORCE=1, the runner exits non-zero if any critical eval case fails.