Unconditionally unlinking the lock file after LOCK_TIMEOUT_MS is unsafe:
the holder may legitimately still be running (slow disk, large usage file),
so removing its lock breaks mutual exclusion and allows concurrent
read-modify-write cycles to overwrite each other's entries.
Remove the stale-lock-removal path entirely and throw ERR_LOCK_TIMEOUT
instead. Callers already swallow the error via .catch() in the write queue,
so the only effect is that the write is skipped rather than risking data
loss through a race.
After the retry loop timed out, withFileLock unconditionally deleted the
lock file and called fn() without reacquiring the lock. If multiple
waiters timed out concurrently they would all enter the critical section
together, defeating the serialisation guarantee and allowing concurrent
read-modify-write cycles to overwrite each other's records.
Fix: after unlinking the stale lock, attempt one final O_EXCL open so
that exactly one concurrent waiter wins the lock and the rest receive
ERR_LOCK_TIMEOUT. The unlocked fast-path is removed entirely.
readJsonArray treated any valid JSON that is not an array as [], causing
appendRecord to overwrite the file with only the new entry — silently
deleting all prior data. This is the same data-loss mode the
malformed-JSON fix was trying to prevent.
Fix: throw ERR_UNEXPECTED_TOKEN_LOG_SHAPE when parsed JSON is not an
array so appendRecord aborts and the existing file is preserved.
The in-memory writeQueues Map serialises writes within one Node process
but two concurrent OpenClaw processes sharing the same workspaceDir
(e.g. parallel CLI runs) can still race: both read the same snapshot
before either writes, and the later writer silently overwrites the
earlier entry.
Add withFileLock() — an O_EXCL advisory lock on <file>.lock — to
coordinate across processes. The per-file in-memory queue is kept to
reduce lock contention within the same process. On lock-acquire failure
the helper retries every 50 ms up to a 5 s timeout; on timeout it
removes a potentially stale lock file and makes one final attempt to
prevent permanent blocking after a crash.
pre-commit: guard the resolve-node.sh source with a file-existence
check so the hook works in test environments that stub only the files
they care about (the integration test creates run-node-tool.sh but not
resolve-node.sh; node is provided via a fake binary in PATH so the
nvm fallback is never needed in that context).
usage-log: replace Math.random() in makeId() with crypto.randomBytes()
to satisfy the temp-path-guard security lint rule that rejects weak
randomness in source files.
readJsonArray previously caught all errors and returned [], so a
malformed token-usage.json (e.g. from an interrupted writeFile) caused
the next recordTokenUsage call to overwrite the file with only the new
entry, permanently erasing all prior records.
Fix: only suppress ENOENT (file not yet created). Any other error
(SyntaxError, EACCES, …) is re-thrown so appendRecord aborts and the
existing file is left intact. The write-queue slot still absorbs the
rejection via .catch() so future writes are not stalled; callers that
need to observe the failure (e.g. attempt.ts) can attach their own
.catch() handler.
taskId was set to params.runId, the same value already stored in the
runId field, giving downstream consumers two identical fields with
different names. Remove taskId from the type and the entry constructor
to avoid confusion.
Fire-and-forget callers (attempt.ts) can trigger two concurrent
recordTokenUsage() calls for the same workspaceDir. The previous
read-modify-write pattern had no locking, so the last writer silently
overwrote the first, losing that run's entry.
Fix: keep a Map<file, Promise<void>> write queue so each write awaits
the previous one. The queue slot is replaced with a no-throw wrapper so
a failed write does not stall future writes.
Added a concurrent-write test (20 parallel calls) that asserts no
record is lost.
The recordTokenUsage function previously only persisted the aggregate tokensUsed
total, discarding the input/output breakdown that was already available via
getUsageTotals(). This meant token-usage.json had no per-record IO split,
making it impossible to analyse input vs output token costs in dashboards.
Changes:
- Add inputTokens, outputTokens, cacheReadTokens, cacheWriteTokens optional
fields to TokenUsageRecord type in usage-log.ts (new file)
- Write these fields (when non-zero) into each usage entry
- Fields are omitted (not null) when unavailable, keeping existing records valid
- Wire up recordTokenUsage() call in attempt.ts after llm_output hook
This is a purely additive change; existing consumers that only read tokensUsed
are unaffected.