openclaw/test/evals/README.md

27 lines
662 B
Markdown

# Chat-Agent Evals
This folder contains a dataset-driven eval harness for web chat-agent behavior.
## What is graded
- `output`: final assistant text quality checks (`mustContain`, `mustNotContain`)
- `trajectory`: tool-call ordering checks (`strict`, `subset`, `superset`, `unordered`)
- `trace`: event-stream integrity checks (required event types, monotonic `globalSeq`,
and tool-call lifecycle completeness)
## Run
```bash
pnpm test:evals
```
This mode is informational and prints a full summary.
## Enforce critical checks
```bash
pnpm test:evals:enforce
```
When `EVALS_ENFORCE=1`, the runner exits non-zero if any **critical** eval case fails.