27 lines
662 B
Markdown
27 lines
662 B
Markdown
# Chat-Agent Evals
|
|
|
|
This folder contains a dataset-driven eval harness for web chat-agent behavior.
|
|
|
|
## What is graded
|
|
|
|
- `output`: final assistant text quality checks (`mustContain`, `mustNotContain`)
|
|
- `trajectory`: tool-call ordering checks (`strict`, `subset`, `superset`, `unordered`)
|
|
- `trace`: event-stream integrity checks (required event types, monotonic `globalSeq`,
|
|
and tool-call lifecycle completeness)
|
|
|
|
## Run
|
|
|
|
```bash
|
|
pnpm test:evals
|
|
```
|
|
|
|
This mode is informational and prints a full summary.
|
|
|
|
## Enforce critical checks
|
|
|
|
```bash
|
|
pnpm test:evals:enforce
|
|
```
|
|
|
|
When `EVALS_ENFORCE=1`, the runner exits non-zero if any **critical** eval case fails.
|