15 Commits

Author SHA1 Message Date
jiarung
d2b7b46604 fix(usage-log): increase lock stale window to 30 s to prevent active-lock steals
appendRecord rewrites the full token-usage.json on every write, so lock
hold time grows with file size and disk speed.  The previous stale: 5_000
was too tight: on large histories or slow disks a write can legitimately
take longer than 5 s, allowing a concurrent waiter to treat the still-
active lock as stale, reclaim it, and run an overlapping read-modify-write
cycle that silently drops the earlier writer's entry.

The risk is amplified by the attempt path where recordTokenUsage is fired
without awaiting, so multiple concurrent runs can legitimately overlap.

Fix:
• Raise stale to 30_000 ms — gives ample headroom for large files on
  slow disks while still reclaiming crashed-process locks within 30 s.
• Match the retry budget: 150 retries × 200 ms ≈ 30 s with jitter,
  so waiters exhaust retries only when the holder exceeds the stale
  window (i.e., is genuinely stuck or has crashed).
2026-03-16 03:07:58 +00:00
jiarung
9f05b36834 fix(usage-log): canonicalize queue key to prevent concurrent writes via path aliases
writeQueues was keyed by the raw workspaceDir-derived path before any
realpath resolution.  Two callers using different spellings of the same
physical directory (a symlink and its target, or a relative vs absolute
path) therefore produced separate queue entries and both entered
appendRecord concurrently.

Inside appendRecord, withFileLock calls resolveNormalizedFilePath which
uses fs.realpath on the directory; both spellings resolve to the same
normalised path.  If one chain is already in fn() — its entry set in
HELD_LOCKS — the second chain's acquireFileLock sees HELD_LOCKS hit for
the same normalised path and re-entrantly joins it.  Both callbacks then
execute the read-modify-write cycle concurrently, and whichever writes
last overwrites the first, silently dropping one entry per collision.

Fix: call fs.realpath(memoryDir) immediately after fs.mkdir and use the
canonical path as both the writeQueues key and the appendRecord file
argument.  A single canonical key means all in-process writers for the
same physical file are serialised through one queue regardless of how
the workspace path was spelled by the caller.

Test: symlink tmpDir to a second name and interleave concurrent
recordTokenUsage calls across both spellings.  Asserts all N records
survive — regression guard for the path-alias queue split.
2026-03-15 08:07:32 +00:00
jiarung
3e1eda63d9 refactor(usage-log): delegate cross-process lock to plugin-sdk/file-lock
appendRecord wrote token-usage.json in place with a direct fs.writeFile
call; a crash or SIGKILL during that write left truncated JSON.  Because
readJsonArray now throws on any non-ENOENT error (to prevent silent data
loss) and recordTokenUsage callers swallow the error via .catch(), one
corrupted write permanently disabled all future token logging until the
file was manually repaired.

The in-place-write bug was fixed in 8c162d0ba via a temp-file + atomic
rename approach, but usage-log.ts still carried its own private
withFileLock / isLockStale implementation.  That inline lock had two
known bugs that were fixed in plugin-sdk/file-lock.ts but never applied
here:

  1. isLockStale treated empty / unparseable lock content as 'not stale'
     — a process that crashes between open('wx') and writeFile(pid)
     leaves an empty .lock that appeared live forever, blocking all
     future writers until it was manually removed.

  2. No inode identity check before unlink: two waiters observing the
     same stale lock could both call unlink; the slower one would
     delete the faster one's freshly-acquired lock, letting both enter
     fn() concurrently and race on the read-modify-write sequence.

Fix: import withFileLock from infra/file-lock.ts (which re-exports the
canonical plugin-sdk implementation) and remove the ~70-line inline lock.
APPEND_LOCK_OPTIONS reproduces the previous timeout/retry budget
(~100 × 50 ms ≈ 5 s) while gaining all fixes from plugin-sdk/file-lock.

The lock payload format changed from a plain PID string to the JSON
{pid, createdAt} envelope expected by the shared implementation; the
stale-lock integration test is updated to match.
2026-03-15 07:36:31 +00:00
jiarung
8c162d0ba4 fix(usage-log): write via temp file and atomic rename to prevent corruption
appendRecord previously called fs.writeFile(token-usage.json, …) directly.
A process crash or SIGKILL during that write can leave the file truncated;
readJsonArray then throws (SyntaxError), and since attempt.ts swallows the
error with .catch(), that one interrupted write silently disables all future
token logging for the workspace until the file is manually repaired.

Fix: write the new content to a uniquely-named sibling temp file first, then
call fs.rename() to atomically replace the real file. rename(2) is atomic on
POSIX when src and dst share the same directory/filesystem, so readers always
see either the old complete file or the new complete file — never a partial
write. The temp file is unlinked on error to avoid leaving orphans.
2026-03-14 18:41:39 +00:00
jiarung
1a5489bf32 fix(usage-log): recover stale lock left by abnormal process exit
A process killed or crashed after creating token-usage.json.lock but
before the finally-unlink runs leaves a permanent stale lock. All
subsequent recordTokenUsage calls for that workspace time out and drop
their entries.

Fix:
- Write the holder's PID into the lock file on acquisition (O_EXCL + writeFile).
- On each EEXIST retry, call isLockStale() which reads the PID and sends
  signal 0 (kill(pid, 0)) to check liveness without delivering a signal.
  ESRCH means the process is gone → lock is stale; any other result
  (alive, EPERM, unreadable file) is treated as live so we never break a
  legitimately held lock.
- If stale, unlink and continue to the next O_EXCL attempt; multiple
  concurrent waiters racing on the steal are safe because only one O_EXCL
  open succeeds.
- Recovery is immediate (no need to wait for LOCK_TIMEOUT_MS).

Add a test that spawns a subprocess, waits for it to exit, writes its
dead PID into the lock file, and asserts recordTokenUsage succeeds and
cleans up the lock.
2026-03-14 16:48:08 +00:00
jiarung
00e05fb4e5 fix(usage-log): do not delete lock on timeout — holder may still be active
Unconditionally unlinking the lock file after LOCK_TIMEOUT_MS is unsafe:
the holder may legitimately still be running (slow disk, large usage file),
so removing its lock breaks mutual exclusion and allows concurrent
read-modify-write cycles to overwrite each other's entries.

Remove the stale-lock-removal path entirely and throw ERR_LOCK_TIMEOUT
instead. Callers already swallow the error via .catch() in the write queue,
so the only effect is that the write is skipped rather than risking data
loss through a race.
2026-03-14 16:35:39 +00:00
jiarung
13b0c1d010 fix(usage-log): reacquire lock via O_EXCL after timeout instead of running unlocked
After the retry loop timed out, withFileLock unconditionally deleted the
lock file and called fn() without reacquiring the lock. If multiple
waiters timed out concurrently they would all enter the critical section
together, defeating the serialisation guarantee and allowing concurrent
read-modify-write cycles to overwrite each other's records.

Fix: after unlinking the stale lock, attempt one final O_EXCL open so
that exactly one concurrent waiter wins the lock and the rest receive
ERR_LOCK_TIMEOUT. The unlocked fast-path is removed entirely.
2026-03-13 23:41:25 +00:00
jiarung
a7a7923d09 fix(usage-log): reject non-array token logs instead of resetting history
readJsonArray treated any valid JSON that is not an array as [], causing
appendRecord to overwrite the file with only the new entry — silently
deleting all prior data. This is the same data-loss mode the
malformed-JSON fix was trying to prevent.

Fix: throw ERR_UNEXPECTED_TOKEN_LOG_SHAPE when parsed JSON is not an
array so appendRecord aborts and the existing file is preserved.
2026-03-13 23:35:12 +00:00
jiarung
f267ff7888 fix(usage-log): add cross-process file lock to prevent write races
The in-memory writeQueues Map serialises writes within one Node process
but two concurrent OpenClaw processes sharing the same workspaceDir
(e.g. parallel CLI runs) can still race: both read the same snapshot
before either writes, and the later writer silently overwrites the
earlier entry.

Add withFileLock() — an O_EXCL advisory lock on <file>.lock — to
coordinate across processes. The per-file in-memory queue is kept to
reduce lock contention within the same process. On lock-acquire failure
the helper retries every 50 ms up to a 5 s timeout; on timeout it
removes a potentially stale lock file and makes one final attempt to
prevent permanent blocking after a crash.
2026-03-13 16:04:22 +00:00
jiarung
386dbb010e fix(git-hooks,usage-log): fix two CI failures
pre-commit: guard the resolve-node.sh source with a file-existence
check so the hook works in test environments that stub only the files
they care about (the integration test creates run-node-tool.sh but not
resolve-node.sh; node is provided via a fake binary in PATH so the
nvm fallback is never needed in that context).

usage-log: replace Math.random() in makeId() with crypto.randomBytes()
to satisfy the temp-path-guard security lint rule that rejects weak
randomness in source files.
2026-03-13 14:57:25 +00:00
jiarung
020001d9b2 fix(usage-log): propagate non-ENOENT read errors to prevent silent data loss
readJsonArray previously caught all errors and returned [], so a
malformed token-usage.json (e.g. from an interrupted writeFile) caused
the next recordTokenUsage call to overwrite the file with only the new
entry, permanently erasing all prior records.

Fix: only suppress ENOENT (file not yet created). Any other error
(SyntaxError, EACCES, …) is re-thrown so appendRecord aborts and the
existing file is left intact. The write-queue slot still absorbs the
rejection via .catch() so future writes are not stalled; callers that
need to observe the failure (e.g. attempt.ts) can attach their own
.catch() handler.
2026-03-13 14:25:13 +00:00
jiarung
cece47f490 fix(usage-log): remove redundant taskId field from TokenUsageRecord
taskId was set to params.runId, the same value already stored in the
runId field, giving downstream consumers two identical fields with
different names. Remove taskId from the type and the entry constructor
to avoid confusion.
2026-03-13 09:40:24 +00:00
jiarung
d03e7ae8ed fix(usage-log): serialise concurrent writes with per-file promise queue
Fire-and-forget callers (attempt.ts) can trigger two concurrent
recordTokenUsage() calls for the same workspaceDir. The previous
read-modify-write pattern had no locking, so the last writer silently
overwrote the first, losing that run's entry.

Fix: keep a Map<file, Promise<void>> write queue so each write awaits
the previous one. The queue slot is replaced with a no-throw wrapper so
a failed write does not stall future writes.

Added a concurrent-write test (20 parallel calls) that asserts no
record is lost.
2026-03-13 09:37:25 +00:00
jiarung
b5cf5aa59f fix(usage-log): add curly braces to satisfy oxlint curly rule 2026-03-13 09:00:56 +00:00
jiarung
8cbc05ae1f feat(usage-log): record inputTokens, outputTokens, cacheReadTokens, cacheWriteTokens
The recordTokenUsage function previously only persisted the aggregate tokensUsed
total, discarding the input/output breakdown that was already available via
getUsageTotals(). This meant token-usage.json had no per-record IO split,
making it impossible to analyse input vs output token costs in dashboards.

Changes:
- Add inputTokens, outputTokens, cacheReadTokens, cacheWriteTokens optional
  fields to TokenUsageRecord type in usage-log.ts (new file)
- Write these fields (when non-zero) into each usage entry
- Fields are omitted (not null) when unavailable, keeping existing records valid
- Wire up recordTokenUsage() call in attempt.ts after llm_output hook

This is a purely additive change; existing consumers that only read tokensUsed
are unaffected.
2026-03-13 06:35:38 +00:00