openclaw/test/evals/README.md

# Chat-Agent Evals

This folder contains a dataset-driven eval harness for web chat-agent behavior.

## What is graded

- `output`: final assistant text quality checks (`mustContain`, `mustNotContain`)
- `trajectory`: tool-call ordering checks (`strict`, `subset`, `superset`, `unordered`)
- `trace`: event-stream integrity checks (required event types, monotonic `globalSeq`,
  and tool-call lifecycle completeness)

## Run

```bash
pnpm test:evals
```

This mode is informational and prints a full summary.

## Enforce critical checks

```bash
pnpm test:evals:enforce
```

When `EVALS_ENFORCE=1`, the runner exits non-zero if any **critical** eval case fails.