appendRecord wrote token-usage.json in place with a direct fs.writeFile
call; a crash or SIGKILL during that write left truncated JSON. Because
readJsonArray now throws on any non-ENOENT error (to prevent silent data
loss) and recordTokenUsage callers swallow the error via .catch(), one
corrupted write permanently disabled all future token logging until the
file was manually repaired.
The in-place-write bug was fixed in 8c162d0ba via a temp-file + atomic
rename approach, but usage-log.ts still carried its own private
withFileLock / isLockStale implementation. That inline lock had two
known bugs that were fixed in plugin-sdk/file-lock.ts but never applied
here:
1. isLockStale treated empty / unparseable lock content as 'not stale'
— a process that crashes between open('wx') and writeFile(pid)
leaves an empty .lock that appeared live forever, blocking all
future writers until it was manually removed.
2. No inode identity check before unlink: two waiters observing the
same stale lock could both call unlink; the slower one would
delete the faster one's freshly-acquired lock, letting both enter
fn() concurrently and race on the read-modify-write sequence.
Fix: import withFileLock from infra/file-lock.ts (which re-exports the
canonical plugin-sdk implementation) and remove the ~70-line inline lock.
APPEND_LOCK_OPTIONS reproduces the previous timeout/retry budget
(~100 × 50 ms ≈ 5 s) while gaining all fixes from plugin-sdk/file-lock.
The lock payload format changed from a plain PID string to the JSON
{pid, createdAt} envelope expected by the shared implementation; the
stale-lock integration test is updated to match.
TOCTOU in the stale-lock branch: isStaleLock(lockPath) returning true
is evaluated under several awaits before unlink is called. If two
waiters (same process or different processes) both observe the same
stale file, waiter A can unlink, create a fresh lock, and start fn(),
then waiter B's delayed unlink removes A's fresh file. B then wins
open(O_EXCL) and both A and B execute fn() concurrently, breaking the
read-modify-write guarantee for token-usage.json.
Fix: snapshot the lock file's inode immediately after the EEXIST, then
re-stat right before the unlink. If the inode changed between the two
stats, a concurrent waiter already reclaimed the stale file and wrote a
fresh lock; leave the new file alone and continue to the next
open(O_EXCL) attempt. The three-outcome table:
staleIno == -1 (file gone by the time we stat)
→ skip unlink, continue: another waiter already handled it
staleIno == currentIno (same stale file still there)
→ safe to unlink; we and the other waiter(s) racing here all call
rm(force:true) — the first succeeds, the rest get silent ENOENT
staleIno != currentIno (inode changed — fresh lock in place)
→ do NOT unlink; continue and let isStaleLock reject the live lock
A note on the in-loop HELD_LOCKS re-check that was considered: joining
the existing holder inside the retry loop would allow two independent
concurrent callers to run fn() simultaneously, which breaks mutual
exclusion. HELD_LOCKS reentrant join is intentionally restricted to the
entry point of acquireFileLock (recursive/reentrant callers only).
Tests added:
- two concurrent waiters on a stale lock never overlap inside fn()
(maxInside assertion, not just result set)
- existing stale-reclaim tests continue to pass
The lock file is created (empty) by open("wx") before pid/createdAt
are written by the subsequent writeFile. A process that crashes in this
narrow window leaves an empty .lock file whose content readLockPayload()
cannot parse (returns null).
Previously isStaleLock skipped both the pid-alive and the age checks
when payload was null, falling through to the mtime stat. If the mtime
was still within staleMs the function returned false, making the empty
lock appear live indefinitely — every future writer would time out and
silently drop its usage record until the file was manually deleted.
Fix: treat null payload (empty, truncated, or non-JSON content) as
stale immediately. Such a file could only have been left by a process
that never completed the write, so it is safe to reclaim without
waiting for the mtime timeout.
The mtime stat fallback is also removed: its only useful case was
exactly this null-payload scenario (it was redundant when payload is
valid, since the pid-alive and createdAt-age checks already cover the
live-lock and aged-out-lock cases).
Tests added:
- empty lock file → reclaimed, callback runs
- truncated/invalid JSON lock file → reclaimed
- pid field not a number → reclaimed
appendRecord previously called fs.writeFile(token-usage.json, …) directly.
A process crash or SIGKILL during that write can leave the file truncated;
readJsonArray then throws (SyntaxError), and since attempt.ts swallows the
error with .catch(), that one interrupted write silently disables all future
token logging for the workspace until the file is manually repaired.
Fix: write the new content to a uniquely-named sibling temp file first, then
call fs.rename() to atomically replace the real file. rename(2) is atomic on
POSIX when src and dst share the same directory/filesystem, so readers always
see either the old complete file or the new complete file — never a partial
write. The temp file is unlinked on error to avoid leaving orphans.
A process killed or crashed after creating token-usage.json.lock but
before the finally-unlink runs leaves a permanent stale lock. All
subsequent recordTokenUsage calls for that workspace time out and drop
their entries.
Fix:
- Write the holder's PID into the lock file on acquisition (O_EXCL + writeFile).
- On each EEXIST retry, call isLockStale() which reads the PID and sends
signal 0 (kill(pid, 0)) to check liveness without delivering a signal.
ESRCH means the process is gone → lock is stale; any other result
(alive, EPERM, unreadable file) is treated as live so we never break a
legitimately held lock.
- If stale, unlink and continue to the next O_EXCL attempt; multiple
concurrent waiters racing on the steal are safe because only one O_EXCL
open succeeds.
- Recovery is immediate (no need to wait for LOCK_TIMEOUT_MS).
Add a test that spawns a subprocess, waits for it to exit, writes its
dead PID into the lock file, and asserts recordTokenUsage succeeds and
cleans up the lock.
Unconditionally unlinking the lock file after LOCK_TIMEOUT_MS is unsafe:
the holder may legitimately still be running (slow disk, large usage file),
so removing its lock breaks mutual exclusion and allows concurrent
read-modify-write cycles to overwrite each other's entries.
Remove the stale-lock-removal path entirely and throw ERR_LOCK_TIMEOUT
instead. Callers already swallow the error via .catch() in the write queue,
so the only effect is that the write is skipped rather than risking data
loss through a race.
After the retry loop timed out, withFileLock unconditionally deleted the
lock file and called fn() without reacquiring the lock. If multiple
waiters timed out concurrently they would all enter the critical section
together, defeating the serialisation guarantee and allowing concurrent
read-modify-write cycles to overwrite each other's records.
Fix: after unlinking the stale lock, attempt one final O_EXCL open so
that exactly one concurrent waiter wins the lock and the rest receive
ERR_LOCK_TIMEOUT. The unlocked fast-path is removed entirely.
readJsonArray treated any valid JSON that is not an array as [], causing
appendRecord to overwrite the file with only the new entry — silently
deleting all prior data. This is the same data-loss mode the
malformed-JSON fix was trying to prevent.
Fix: throw ERR_UNEXPECTED_TOKEN_LOG_SHAPE when parsed JSON is not an
array so appendRecord aborts and the existing file is preserved.
The in-memory writeQueues Map serialises writes within one Node process
but two concurrent OpenClaw processes sharing the same workspaceDir
(e.g. parallel CLI runs) can still race: both read the same snapshot
before either writes, and the later writer silently overwrites the
earlier entry.
Add withFileLock() — an O_EXCL advisory lock on <file>.lock — to
coordinate across processes. The per-file in-memory queue is kept to
reduce lock contention within the same process. On lock-acquire failure
the helper retries every 50 ms up to a 5 s timeout; on timeout it
removes a potentially stale lock file and makes one final attempt to
prevent permanent blocking after a crash.
pre-commit: guard the resolve-node.sh source with a file-existence
check so the hook works in test environments that stub only the files
they care about (the integration test creates run-node-tool.sh but not
resolve-node.sh; node is provided via a fake binary in PATH so the
nvm fallback is never needed in that context).
usage-log: replace Math.random() in makeId() with crypto.randomBytes()
to satisfy the temp-path-guard security lint rule that rejects weak
randomness in source files.
readJsonArray previously caught all errors and returned [], so a
malformed token-usage.json (e.g. from an interrupted writeFile) caused
the next recordTokenUsage call to overwrite the file with only the new
entry, permanently erasing all prior records.
Fix: only suppress ENOENT (file not yet created). Any other error
(SyntaxError, EACCES, …) is re-thrown so appendRecord aborts and the
existing file is left intact. The write-queue slot still absorbs the
rejection via .catch() so future writes are not stalled; callers that
need to observe the failure (e.g. attempt.ts) can attach their own
.catch() handler.
taskId was set to params.runId, the same value already stored in the
runId field, giving downstream consumers two identical fields with
different names. Remove taskId from the type and the entry constructor
to avoid confusion.
Fire-and-forget callers (attempt.ts) can trigger two concurrent
recordTokenUsage() calls for the same workspaceDir. The previous
read-modify-write pattern had no locking, so the last writer silently
overwrote the first, losing that run's entry.
Fix: keep a Map<file, Promise<void>> write queue so each write awaits
the previous one. The queue slot is replaced with a no-throw wrapper so
a failed write does not stall future writes.
Added a concurrent-write test (20 parallel calls) that asserts no
record is lost.
The recordTokenUsage function previously only persisted the aggregate tokensUsed
total, discarding the input/output breakdown that was already available via
getUsageTotals(). This meant token-usage.json had no per-record IO split,
making it impossible to analyse input vs output token costs in dashboards.
Changes:
- Add inputTokens, outputTokens, cacheReadTokens, cacheWriteTokens optional
fields to TokenUsageRecord type in usage-log.ts (new file)
- Write these fields (when non-zero) into each usage entry
- Fields are omitted (not null) when unavailable, keeping existing records valid
- Wire up recordTokenUsage() call in attempt.ts after llm_output hook
This is a purely additive change; existing consumers that only read tokensUsed
are unaffected.
* fix(telegram): preserve media download transport policy
* refactor(telegram): thread media transport policy
* fix(telegram): sync fallback media policy
* fix: note telegram media transport fix (#44639)
Process messageData via handleDeltaEvent for both delta and final states
before resolving the turn, so ACP clients no longer drop the last visible
assistant text when the gateway sends the final message body on the
terminal chat event.
Closes#15377
Based on #17615
Co-authored-by: PJ Eby <3527052+pjeby@users.noreply.github.com>