openclaw

Author	SHA1	Message	Date
jiarung	d2b7b46604	fix(usage-log): increase lock stale window to 30 s to prevent active-lock steals appendRecord rewrites the full token-usage.json on every write, so lock hold time grows with file size and disk speed. The previous stale: 5_000 was too tight: on large histories or slow disks a write can legitimately take longer than 5 s, allowing a concurrent waiter to treat the still- active lock as stale, reclaim it, and run an overlapping read-modify-write cycle that silently drops the earlier writer's entry. The risk is amplified by the attempt path where recordTokenUsage is fired without awaiting, so multiple concurrent runs can legitimately overlap. Fix: • Raise stale to 30_000 ms — gives ample headroom for large files on slow disks while still reclaiming crashed-process locks within 30 s. • Match the retry budget: 150 retries × 200 ms ≈ 30 s with jitter, so waiters exhaust retries only when the holder exceeds the stale window (i.e., is genuinely stuck or has crashed).	2026-03-16 03:07:58 +00:00
jiarung	9f05b36834	fix(usage-log): canonicalize queue key to prevent concurrent writes via path aliases writeQueues was keyed by the raw workspaceDir-derived path before any realpath resolution. Two callers using different spellings of the same physical directory (a symlink and its target, or a relative vs absolute path) therefore produced separate queue entries and both entered appendRecord concurrently. Inside appendRecord, withFileLock calls resolveNormalizedFilePath which uses fs.realpath on the directory; both spellings resolve to the same normalised path. If one chain is already in fn() — its entry set in HELD_LOCKS — the second chain's acquireFileLock sees HELD_LOCKS hit for the same normalised path and re-entrantly joins it. Both callbacks then execute the read-modify-write cycle concurrently, and whichever writes last overwrites the first, silently dropping one entry per collision. Fix: call fs.realpath(memoryDir) immediately after fs.mkdir and use the canonical path as both the writeQueues key and the appendRecord file argument. A single canonical key means all in-process writers for the same physical file are serialised through one queue regardless of how the workspace path was spelled by the caller. Test: symlink tmpDir to a second name and interleave concurrent recordTokenUsage calls across both spellings. Asserts all N records survive — regression guard for the path-alias queue split.	2026-03-15 08:07:32 +00:00
jiarung	3e1eda63d9	refactor(usage-log): delegate cross-process lock to plugin-sdk/file-lock appendRecord wrote token-usage.json in place with a direct fs.writeFile call; a crash or SIGKILL during that write left truncated JSON. Because readJsonArray now throws on any non-ENOENT error (to prevent silent data loss) and recordTokenUsage callers swallow the error via .catch(), one corrupted write permanently disabled all future token logging until the file was manually repaired. The in-place-write bug was fixed in 8c162d0ba via a temp-file + atomic rename approach, but usage-log.ts still carried its own private withFileLock / isLockStale implementation. That inline lock had two known bugs that were fixed in plugin-sdk/file-lock.ts but never applied here: 1. isLockStale treated empty / unparseable lock content as 'not stale' — a process that crashes between open('wx') and writeFile(pid) leaves an empty .lock that appeared live forever, blocking all future writers until it was manually removed. 2. No inode identity check before unlink: two waiters observing the same stale lock could both call unlink; the slower one would delete the faster one's freshly-acquired lock, letting both enter fn() concurrently and race on the read-modify-write sequence. Fix: import withFileLock from infra/file-lock.ts (which re-exports the canonical plugin-sdk implementation) and remove the ~70-line inline lock. APPEND_LOCK_OPTIONS reproduces the previous timeout/retry budget (~100 × 50 ms ≈ 5 s) while gaining all fixes from plugin-sdk/file-lock. The lock payload format changed from a plain PID string to the JSON {pid, createdAt} envelope expected by the shared implementation; the stale-lock integration test is updated to match.	2026-03-15 07:36:31 +00:00
jiarung	8c162d0ba4	fix(usage-log): write via temp file and atomic rename to prevent corruption appendRecord previously called fs.writeFile(token-usage.json, …) directly. A process crash or SIGKILL during that write can leave the file truncated; readJsonArray then throws (SyntaxError), and since attempt.ts swallows the error with .catch(), that one interrupted write silently disables all future token logging for the workspace until the file is manually repaired. Fix: write the new content to a uniquely-named sibling temp file first, then call fs.rename() to atomically replace the real file. rename(2) is atomic on POSIX when src and dst share the same directory/filesystem, so readers always see either the old complete file or the new complete file — never a partial write. The temp file is unlinked on error to avoid leaving orphans.	2026-03-14 18:41:39 +00:00
jiarung	1a5489bf32	fix(usage-log): recover stale lock left by abnormal process exit A process killed or crashed after creating token-usage.json.lock but before the finally-unlink runs leaves a permanent stale lock. All subsequent recordTokenUsage calls for that workspace time out and drop their entries. Fix: - Write the holder's PID into the lock file on acquisition (O_EXCL + writeFile). - On each EEXIST retry, call isLockStale() which reads the PID and sends signal 0 (kill(pid, 0)) to check liveness without delivering a signal. ESRCH means the process is gone → lock is stale; any other result (alive, EPERM, unreadable file) is treated as live so we never break a legitimately held lock. - If stale, unlink and continue to the next O_EXCL attempt; multiple concurrent waiters racing on the steal are safe because only one O_EXCL open succeeds. - Recovery is immediate (no need to wait for LOCK_TIMEOUT_MS). Add a test that spawns a subprocess, waits for it to exit, writes its dead PID into the lock file, and asserts recordTokenUsage succeeds and cleans up the lock.	2026-03-14 16:48:08 +00:00
jiarung	00e05fb4e5	fix(usage-log): do not delete lock on timeout — holder may still be active Unconditionally unlinking the lock file after LOCK_TIMEOUT_MS is unsafe: the holder may legitimately still be running (slow disk, large usage file), so removing its lock breaks mutual exclusion and allows concurrent read-modify-write cycles to overwrite each other's entries. Remove the stale-lock-removal path entirely and throw ERR_LOCK_TIMEOUT instead. Callers already swallow the error via .catch() in the write queue, so the only effect is that the write is skipped rather than risking data loss through a race.	2026-03-14 16:35:39 +00:00
jiarung	13b0c1d010	fix(usage-log): reacquire lock via O_EXCL after timeout instead of running unlocked After the retry loop timed out, withFileLock unconditionally deleted the lock file and called fn() without reacquiring the lock. If multiple waiters timed out concurrently they would all enter the critical section together, defeating the serialisation guarantee and allowing concurrent read-modify-write cycles to overwrite each other's records. Fix: after unlinking the stale lock, attempt one final O_EXCL open so that exactly one concurrent waiter wins the lock and the rest receive ERR_LOCK_TIMEOUT. The unlocked fast-path is removed entirely.	2026-03-13 23:41:25 +00:00
jiarung	a7a7923d09	fix(usage-log): reject non-array token logs instead of resetting history readJsonArray treated any valid JSON that is not an array as [], causing appendRecord to overwrite the file with only the new entry — silently deleting all prior data. This is the same data-loss mode the malformed-JSON fix was trying to prevent. Fix: throw ERR_UNEXPECTED_TOKEN_LOG_SHAPE when parsed JSON is not an array so appendRecord aborts and the existing file is preserved.	2026-03-13 23:35:12 +00:00
jiarung	f267ff7888	fix(usage-log): add cross-process file lock to prevent write races The in-memory writeQueues Map serialises writes within one Node process but two concurrent OpenClaw processes sharing the same workspaceDir (e.g. parallel CLI runs) can still race: both read the same snapshot before either writes, and the later writer silently overwrites the earlier entry. Add withFileLock() — an O_EXCL advisory lock on <file>.lock — to coordinate across processes. The per-file in-memory queue is kept to reduce lock contention within the same process. On lock-acquire failure the helper retries every 50 ms up to a 5 s timeout; on timeout it removes a potentially stale lock file and makes one final attempt to prevent permanent blocking after a crash.	2026-03-13 16:04:22 +00:00
jiarung	386dbb010e	fix(git-hooks,usage-log): fix two CI failures pre-commit: guard the resolve-node.sh source with a file-existence check so the hook works in test environments that stub only the files they care about (the integration test creates run-node-tool.sh but not resolve-node.sh; node is provided via a fake binary in PATH so the nvm fallback is never needed in that context). usage-log: replace Math.random() in makeId() with crypto.randomBytes() to satisfy the temp-path-guard security lint rule that rejects weak randomness in source files.	2026-03-13 14:57:25 +00:00
jiarung	020001d9b2	fix(usage-log): propagate non-ENOENT read errors to prevent silent data loss readJsonArray previously caught all errors and returned [], so a malformed token-usage.json (e.g. from an interrupted writeFile) caused the next recordTokenUsage call to overwrite the file with only the new entry, permanently erasing all prior records. Fix: only suppress ENOENT (file not yet created). Any other error (SyntaxError, EACCES, …) is re-thrown so appendRecord aborts and the existing file is left intact. The write-queue slot still absorbs the rejection via .catch() so future writes are not stalled; callers that need to observe the failure (e.g. attempt.ts) can attach their own .catch() handler.	2026-03-13 14:25:13 +00:00
jiarung	cece47f490	fix(usage-log): remove redundant taskId field from TokenUsageRecord taskId was set to params.runId, the same value already stored in the runId field, giving downstream consumers two identical fields with different names. Remove taskId from the type and the entry constructor to avoid confusion.	2026-03-13 09:40:24 +00:00
jiarung	d03e7ae8ed	fix(usage-log): serialise concurrent writes with per-file promise queue Fire-and-forget callers (attempt.ts) can trigger two concurrent recordTokenUsage() calls for the same workspaceDir. The previous read-modify-write pattern had no locking, so the last writer silently overwrote the first, losing that run's entry. Fix: keep a Map<file, Promise<void>> write queue so each write awaits the previous one. The queue slot is replaced with a no-throw wrapper so a failed write does not stall future writes. Added a concurrent-write test (20 parallel calls) that asserts no record is lost.	2026-03-13 09:37:25 +00:00
jiarung	b5cf5aa59f	fix(usage-log): add curly braces to satisfy oxlint curly rule	2026-03-13 09:00:56 +00:00
jiarung	8cbc05ae1f	feat(usage-log): record inputTokens, outputTokens, cacheReadTokens, cacheWriteTokens The recordTokenUsage function previously only persisted the aggregate tokensUsed total, discarding the input/output breakdown that was already available via getUsageTotals(). This meant token-usage.json had no per-record IO split, making it impossible to analyse input vs output token costs in dashboards. Changes: - Add inputTokens, outputTokens, cacheReadTokens, cacheWriteTokens optional fields to TokenUsageRecord type in usage-log.ts (new file) - Write these fields (when non-zero) into each usage entry - Fields are omitted (not null) when unavailable, keeping existing records valid - Wire up recordTokenUsage() call in attempt.ts after llm_output hook This is a purely additive change; existing consumers that only read tokensUsed are unaffected.	2026-03-13 06:35:38 +00:00

15 Commits