openclaw

Author	SHA1	Message	Date
jiarung	3e1eda63d9	refactor(usage-log): delegate cross-process lock to plugin-sdk/file-lock appendRecord wrote token-usage.json in place with a direct fs.writeFile call; a crash or SIGKILL during that write left truncated JSON. Because readJsonArray now throws on any non-ENOENT error (to prevent silent data loss) and recordTokenUsage callers swallow the error via .catch(), one corrupted write permanently disabled all future token logging until the file was manually repaired. The in-place-write bug was fixed in 8c162d0ba via a temp-file + atomic rename approach, but usage-log.ts still carried its own private withFileLock / isLockStale implementation. That inline lock had two known bugs that were fixed in plugin-sdk/file-lock.ts but never applied here: 1. isLockStale treated empty / unparseable lock content as 'not stale' — a process that crashes between open('wx') and writeFile(pid) leaves an empty .lock that appeared live forever, blocking all future writers until it was manually removed. 2. No inode identity check before unlink: two waiters observing the same stale lock could both call unlink; the slower one would delete the faster one's freshly-acquired lock, letting both enter fn() concurrently and race on the read-modify-write sequence. Fix: import withFileLock from infra/file-lock.ts (which re-exports the canonical plugin-sdk implementation) and remove the ~70-line inline lock. APPEND_LOCK_OPTIONS reproduces the previous timeout/retry budget (~100 × 50 ms ≈ 5 s) while gaining all fixes from plugin-sdk/file-lock. The lock payload format changed from a plain PID string to the JSON {pid, createdAt} envelope expected by the shared implementation; the stale-lock integration test is updated to match.	2026-03-15 07:36:31 +00:00
jiarung	8d636b8a61	fix(file-lock): guard stale-lock reclaim with inode identity check TOCTOU in the stale-lock branch: isStaleLock(lockPath) returning true is evaluated under several awaits before unlink is called. If two waiters (same process or different processes) both observe the same stale file, waiter A can unlink, create a fresh lock, and start fn(), then waiter B's delayed unlink removes A's fresh file. B then wins open(O_EXCL) and both A and B execute fn() concurrently, breaking the read-modify-write guarantee for token-usage.json. Fix: snapshot the lock file's inode immediately after the EEXIST, then re-stat right before the unlink. If the inode changed between the two stats, a concurrent waiter already reclaimed the stale file and wrote a fresh lock; leave the new file alone and continue to the next open(O_EXCL) attempt. The three-outcome table: staleIno == -1 (file gone by the time we stat) → skip unlink, continue: another waiter already handled it staleIno == currentIno (same stale file still there) → safe to unlink; we and the other waiter(s) racing here all call rm(force:true) — the first succeeds, the rest get silent ENOENT staleIno != currentIno (inode changed — fresh lock in place) → do NOT unlink; continue and let isStaleLock reject the live lock A note on the in-loop HELD_LOCKS re-check that was considered: joining the existing holder inside the retry loop would allow two independent concurrent callers to run fn() simultaneously, which breaks mutual exclusion. HELD_LOCKS reentrant join is intentionally restricted to the entry point of acquireFileLock (recursive/reentrant callers only). Tests added: - two concurrent waiters on a stale lock never overlap inside fn() (maxInside assertion, not just result set) - existing stale-reclaim tests continue to pass	2026-03-15 05:56:09 +00:00
jiarung	c5c92e6be1	fix(file-lock): reclaim lock files with invalid or empty content The lock file is created (empty) by open("wx") before pid/createdAt are written by the subsequent writeFile. A process that crashes in this narrow window leaves an empty .lock file whose content readLockPayload() cannot parse (returns null). Previously isStaleLock skipped both the pid-alive and the age checks when payload was null, falling through to the mtime stat. If the mtime was still within staleMs the function returned false, making the empty lock appear live indefinitely — every future writer would time out and silently drop its usage record until the file was manually deleted. Fix: treat null payload (empty, truncated, or non-JSON content) as stale immediately. Such a file could only have been left by a process that never completed the write, so it is safe to reclaim without waiting for the mtime timeout. The mtime stat fallback is also removed: its only useful case was exactly this null-payload scenario (it was redundant when payload is valid, since the pid-alive and createdAt-age checks already cover the live-lock and aged-out-lock cases). Tests added: - empty lock file → reclaimed, callback runs - truncated/invalid JSON lock file → reclaimed - pid field not a number → reclaimed	2026-03-15 05:42:29 +00:00
jiarung	8c162d0ba4	fix(usage-log): write via temp file and atomic rename to prevent corruption appendRecord previously called fs.writeFile(token-usage.json, …) directly. A process crash or SIGKILL during that write can leave the file truncated; readJsonArray then throws (SyntaxError), and since attempt.ts swallows the error with .catch(), that one interrupted write silently disables all future token logging for the workspace until the file is manually repaired. Fix: write the new content to a uniquely-named sibling temp file first, then call fs.rename() to atomically replace the real file. rename(2) is atomic on POSIX when src and dst share the same directory/filesystem, so readers always see either the old complete file or the new complete file — never a partial write. The temp file is unlinked on error to avoid leaving orphans.	2026-03-14 18:41:39 +00:00
jiarung	1a5489bf32	fix(usage-log): recover stale lock left by abnormal process exit A process killed or crashed after creating token-usage.json.lock but before the finally-unlink runs leaves a permanent stale lock. All subsequent recordTokenUsage calls for that workspace time out and drop their entries. Fix: - Write the holder's PID into the lock file on acquisition (O_EXCL + writeFile). - On each EEXIST retry, call isLockStale() which reads the PID and sends signal 0 (kill(pid, 0)) to check liveness without delivering a signal. ESRCH means the process is gone → lock is stale; any other result (alive, EPERM, unreadable file) is treated as live so we never break a legitimately held lock. - If stale, unlink and continue to the next O_EXCL attempt; multiple concurrent waiters racing on the steal are safe because only one O_EXCL open succeeds. - Recovery is immediate (no need to wait for LOCK_TIMEOUT_MS). Add a test that spawns a subprocess, waits for it to exit, writes its dead PID into the lock file, and asserts recordTokenUsage succeeds and cleans up the lock.	2026-03-14 16:48:08 +00:00
jiarung	00e05fb4e5	fix(usage-log): do not delete lock on timeout — holder may still be active Unconditionally unlinking the lock file after LOCK_TIMEOUT_MS is unsafe: the holder may legitimately still be running (slow disk, large usage file), so removing its lock breaks mutual exclusion and allows concurrent read-modify-write cycles to overwrite each other's entries. Remove the stale-lock-removal path entirely and throw ERR_LOCK_TIMEOUT instead. Callers already swallow the error via .catch() in the write queue, so the only effect is that the write is skipped rather than risking data loss through a race.	2026-03-14 16:35:39 +00:00
jiarung	13b0c1d010	fix(usage-log): reacquire lock via O_EXCL after timeout instead of running unlocked After the retry loop timed out, withFileLock unconditionally deleted the lock file and called fn() without reacquiring the lock. If multiple waiters timed out concurrently they would all enter the critical section together, defeating the serialisation guarantee and allowing concurrent read-modify-write cycles to overwrite each other's records. Fix: after unlinking the stale lock, attempt one final O_EXCL open so that exactly one concurrent waiter wins the lock and the rest receive ERR_LOCK_TIMEOUT. The unlocked fast-path is removed entirely.	2026-03-13 23:41:25 +00:00
jiarung	a7a7923d09	fix(usage-log): reject non-array token logs instead of resetting history readJsonArray treated any valid JSON that is not an array as [], causing appendRecord to overwrite the file with only the new entry — silently deleting all prior data. This is the same data-loss mode the malformed-JSON fix was trying to prevent. Fix: throw ERR_UNEXPECTED_TOKEN_LOG_SHAPE when parsed JSON is not an array so appendRecord aborts and the existing file is preserved.	2026-03-13 23:35:12 +00:00
jiarung	f267ff7888	fix(usage-log): add cross-process file lock to prevent write races The in-memory writeQueues Map serialises writes within one Node process but two concurrent OpenClaw processes sharing the same workspaceDir (e.g. parallel CLI runs) can still race: both read the same snapshot before either writes, and the later writer silently overwrites the earlier entry. Add withFileLock() — an O_EXCL advisory lock on <file>.lock — to coordinate across processes. The per-file in-memory queue is kept to reduce lock contention within the same process. On lock-acquire failure the helper retries every 50 ms up to a 5 s timeout; on timeout it removes a potentially stale lock file and makes one final attempt to prevent permanent blocking after a crash.	2026-03-13 16:04:22 +00:00
jiarung	f4c4ab416d	style(attempt): fix import order to pass oxfmt check	2026-03-13 14:59:24 +00:00
jiarung	386dbb010e	fix(git-hooks,usage-log): fix two CI failures pre-commit: guard the resolve-node.sh source with a file-existence check so the hook works in test environments that stub only the files they care about (the integration test creates run-node-tool.sh but not resolve-node.sh; node is provided via a fake binary in PATH so the nvm fallback is never needed in that context). usage-log: replace Math.random() in makeId() with crypto.randomBytes() to satisfy the temp-path-guard security lint rule that rejects weak randomness in source files.	2026-03-13 14:57:25 +00:00
jiarung	020001d9b2	fix(usage-log): propagate non-ENOENT read errors to prevent silent data loss readJsonArray previously caught all errors and returned [], so a malformed token-usage.json (e.g. from an interrupted writeFile) caused the next recordTokenUsage call to overwrite the file with only the new entry, permanently erasing all prior records. Fix: only suppress ENOENT (file not yet created). Any other error (SyntaxError, EACCES, …) is re-thrown so appendRecord aborts and the existing file is left intact. The write-queue slot still absorbs the rejection via .catch() so future writes are not stalled; callers that need to observe the failure (e.g. attempt.ts) can attach their own .catch() handler.	2026-03-13 14:25:13 +00:00
jiarung	cece47f490	fix(usage-log): remove redundant taskId field from TokenUsageRecord taskId was set to params.runId, the same value already stored in the runId field, giving downstream consumers two identical fields with different names. Remove taskId from the type and the entry constructor to avoid confusion.	2026-03-13 09:40:24 +00:00
jiarung	d03e7ae8ed	fix(usage-log): serialise concurrent writes with per-file promise queue Fire-and-forget callers (attempt.ts) can trigger two concurrent recordTokenUsage() calls for the same workspaceDir. The previous read-modify-write pattern had no locking, so the last writer silently overwrote the first, losing that run's entry. Fix: keep a Map<file, Promise<void>> write queue so each write awaits the previous one. The queue slot is replaced with a no-throw wrapper so a failed write does not stall future writes. Added a concurrent-write test (20 parallel calls) that asserts no record is lost.	2026-03-13 09:37:25 +00:00
jiarung	83a566ce99	style(usage-log): apply oxfmt formatting to test file	2026-03-13 09:24:36 +00:00
jiarung	b5cf5aa59f	fix(usage-log): add curly braces to satisfy oxlint curly rule	2026-03-13 09:00:56 +00:00
jiarung	feefa8568f	test(usage-log): add unit tests for recordTokenUsage IO fields	2026-03-13 08:43:18 +00:00
jiarung	8cbc05ae1f	feat(usage-log): record inputTokens, outputTokens, cacheReadTokens, cacheWriteTokens The recordTokenUsage function previously only persisted the aggregate tokensUsed total, discarding the input/output breakdown that was already available via getUsageTotals(). This meant token-usage.json had no per-record IO split, making it impossible to analyse input vs output token costs in dashboards. Changes: - Add inputTokens, outputTokens, cacheReadTokens, cacheWriteTokens optional fields to TokenUsageRecord type in usage-log.ts (new file) - Write these fields (when non-zero) into each usage entry - Fields are omitted (not null) when unavailable, keeping existing records valid - Wire up recordTokenUsage() call in attempt.ts after llm_output hook This is a purely additive change; existing consumers that only read tokensUsed are unaffected.	2026-03-13 06:35:38 +00:00
Frank Yang	fa6ff39b9b	fix: recover outbound plugins from the active registry	2026-03-13 14:32:07 +08:00
Peter Steinberger	a0f09a4589	test: fix windows startup fallback mock typing	2026-03-13 05:00:55 +00:00
Peter Steinberger	32d8ec9482	fix: harden windows gateway fallback launch	2026-03-13 04:58:35 +00:00
Josh Lehman	6d0939d84e	fix: handle Discord gateway metadata fetch failures (#44397 ) Merged via squash. Prepared head SHA: edd17c0effe4f90887ac94ce549f44a69fe19eb2 Co-authored-by: jalehman <550978+jalehman@users.noreply.github.com> Co-authored-by: jalehman <550978+jalehman@users.noreply.github.com> Reviewed-by: @jalehman	2026-03-12 21:52:17 -07:00
Ayaan Zaidi	8023f4c701	fix(telegram): thread media transport policy into SSRF (#44639 ) * fix(telegram): preserve media download transport policy * refactor(telegram): thread media transport policy * fix(telegram): sync fallback media policy * fix: note telegram media transport fix (#44639)	2026-03-13 10:11:43 +05:30
Peter Steinberger	c38e7b0270	test(utils): await temp dir cleanup in async tests	2026-03-13 04:38:46 +00:00
Efe Büken	771066d122	fix(compaction): use full-session token count for post-compaction sanity check (#28347 ) Merged via squash. Prepared head SHA: cf4eab1c51e6b8890e23c2d7172313c40cd2fe04 Co-authored-by: efe-arv <259833796+efe-arv@users.noreply.github.com> Co-authored-by: jalehman <550978+jalehman@users.noreply.github.com> Reviewed-by: @jalehman	2026-03-12 21:26:30 -07:00
Peter Steinberger	fc2b796f02	test(proxy): make env proxy tests windows-safe	2026-03-13 04:17:10 +00:00
Peter Steinberger	6472949f25	fix(plugins): normalize bundled provider ids	2026-03-13 04:10:06 +00:00
Cypherm	61d219cb39	feat: show status reaction during context compaction (#35474 ) Merged via squash. Prepared head SHA: 145a7b7c4e1939718c41a300899ae813bd9c511b Co-authored-by: Cypherm <28184436+Cypherm@users.noreply.github.com> Co-authored-by: jalehman <550978+jalehman@users.noreply.github.com> Reviewed-by: @jalehman	2026-03-12 21:06:15 -07:00
Peter Steinberger	c52f23f794	test(qmd): make windows cli fixtures explicit	2026-03-13 03:37:41 +00:00
Peter Steinberger	f803215474	fix(ci): restore full gate	2026-03-13 03:34:47 +00:00
Peter Steinberger	0f290fe6d6	fix: narrow Slack outbound blocks opt type	2026-03-13 03:29:00 +00:00
scoootscooob	17c954c46e	fix(acp): preserve final assistant message snapshot before end_turn (#44597 ) Process messageData via handleDeltaEvent for both delta and final states before resolving the turn, so ACP clients no longer drop the last visible assistant text when the gateway sends the final message body on the terminal chat event. Closes #15377 Based on #17615 Co-authored-by: PJ Eby <3527052+pjeby@users.noreply.github.com>	2026-03-12 20:23:57 -07:00
Peter Steinberger	2201d533fd	fix: enable fast mode for isolated cron runs	2026-03-13 03:21:57 +00:00
Vincent Koc	42efd98ff8	Slack: support Block Kit payloads in agent replies (#44592 ) * Slack: route reply blocks through outbound adapter * Slack: cover Block Kit outbound payloads * Changelog: add Slack Block Kit agent reply entry	2026-03-12 23:18:59 -04:00
Peter Steinberger	433e65711f	fix: fall back to a startup entry for windows gateway install	2026-03-13 03:18:17 +00:00
Peter Steinberger	a60a4b4b5e	test(gateway): avoid hoisted reply mock tdz	2026-03-13 03:17:51 +00:00
Peter Steinberger	0979264ed5	test(qmd): make windows cli fixtures explicit	2026-03-13 03:13:56 +00:00
Peter Steinberger	496ca3a637	fix(feishu): fail closed on webhook signature checks	2026-03-13 03:13:56 +00:00
Peter Steinberger	ec3c20d96d	test: harden plugin fixture permissions on macos	2026-03-13 03:13:25 +00:00
Peter Steinberger	fb9984a774	fix(memory): stop forcing Windows qmd cmd shims	2026-03-13 03:09:14 +00:00
Ayaan Zaidi	ff2368af57	fix: stop false cron payload-kind warnings in doctor (#44012 ) (thanks @shuicici)	2026-03-13 08:38:52 +05:30
shuicici	42613b9baa	fix(cron): compare raw value not trimmed in normalizePayloadKind	2026-03-13 08:38:52 +05:30
shuicici	3e2c776aaf	fix(cron): avoid false legacy payload kind migrations	2026-03-13 08:38:52 +05:30
Peter Steinberger	21fa50f564	test: harden plugin env-scoped fixtures	2026-03-13 03:01:47 +00:00
Peter Steinberger	6b14e6b55b	test(commands): align slash-command config persistence coverage	2026-03-13 02:51:55 +00:00
Peter Steinberger	7dc447f79f	fix(gateway): strip unbound scopes for shared-auth connects	2026-03-13 02:51:55 +00:00
Peter Steinberger	b858d6c3a9	fix: clarify windows onboarding gateway health	2026-03-13 02:40:40 +00:00
Dinakar Sarbada	23c7fc745f	refactor(agents): replace console.warn with SubsystemLogger in compaction-safeguard.ts (#9974 ) Merged via squash. Prepared head SHA: 35dcc5ba354ad7f058d796846bda9d1f8a416e04 Co-authored-by: dinakars777 <250428393+dinakars777@users.noreply.github.com> Co-authored-by: jalehman <550978+jalehman@users.noreply.github.com> Reviewed-by: @jalehman	2026-03-12 19:34:55 -07:00
Peter Steinberger	c8439f6587	fix: import oauth types from the oauth entrypoint	2026-03-13 02:17:00 +00:00
Peter Steinberger	296a106f49	test: stabilize hooks loader log assertion on Windows	2026-03-13 02:17:00 +00:00

1 2 3 4 5 ...

11697 Commits