runEmbeddedAttempt calls process.chdir(effectiveWorkspace) early in the
run, then later invokes recordTokenUsage with the raw params.workspaceDir
string. If workspaceDir is a relative path (e.g. ./ws) recordTokenUsage
resolves it from the already-changed cwd, producing a nested path
(`./ws/ws/memory/token-usage.json`) or an outright failure.
Fix: pass effectiveWorkspace (the fully-resolved, sandbox-aware absolute
path that was used for every other workspace operation in the run) into
recordTokenUsage so usage logs always land in the correct directory.
843e3c1ef restored a recency grace window (60 s) for append messages:
messages newer than connectedAtMs - 60 s are still forwarded to
onMessage so genuinely recent offline arrivals trigger auto-reply.
The test 'handles append messages by marking them read but skipping
auto-reply' used nowSeconds() as the message timestamp, which falls
inside the grace window and therefore reaches onMessage — contradicting
the expect(onMessage).not.toHaveBeenCalled() assertion.
Fix: use nowSeconds(-120_000) (2 minutes before now) so the message is
clearly outside the grace window and the append-recency filter correctly
skips it.
When auth is completely disabled (mode=none), requiring device pairing
for Control UI operator sessions adds friction without security value
since any client can already connect without credentials.
Add authMode parameter to shouldSkipControlUiPairing so the bypass
fires only for Control UI + operator role + auth.mode=none. This avoids
the #43478 regression where a top-level OR disabled pairing for ALL
websocket clients.
* fix(web): handle 515 Stream Error during WhatsApp QR pairing
getStatusCode() never unwrapped the lastDisconnect wrapper object,
so login.errorStatus was always undefined and the 515 restart path
in restartLoginSocket was dead code.
- Add err.error?.output?.statusCode fallback to getStatusCode()
- Export waitForCredsSaveQueue() so callers can await pending creds
- Await creds flush in restartLoginSocket before creating new socket
Fixes#3942
* test: update session mock for getStatusCode unwrap + waitForCredsSaveQueue
Mirror the getStatusCode fix (err.error?.output?.statusCode fallback)
in the test mock and export waitForCredsSaveQueue so restartLoginSocket
tests work correctly.
* fix(web): scope creds save queue per-authDir to avoid cross-account blocking
The credential save queue was a single global promise chain shared by all
WhatsApp accounts. In multi-account setups, a slow save on one account
blocked credential writes and 515 restart recovery for unrelated accounts.
Replace the global queue with a per-authDir Map so each account's creds
serialize independently. waitForCredsSaveQueue() now accepts an optional
authDir to wait on a single account's queue, or waits on all when omitted.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* test: use real Baileys v7 error shape in 515 restart test
The test was using { output: { statusCode: 515 } } which was already
handled before the fix. Updated to use the actual Baileys v7 shape
{ error: { output: { statusCode: 515 } } } to cover the new fallback
path in getStatusCode.
Co-Authored-By: Claude Code (Opus 4.6) <noreply@anthropic.com>
* fix(web): bound credential-queue wait during 515 restart
Prevents restartLoginSocket from blocking indefinitely if a queued
saveCreds() promise stalls (e.g. hung filesystem write).
Co-Authored-By: Claude <noreply@anthropic.com>
* fix: clear flush timeout handle and assert creds queue in test
Co-Authored-By: Claude <noreply@anthropic.com>
* fix: evict settled credsSaveQueues entries to prevent unbounded growth
Co-Authored-By: Claude <noreply@anthropic.com>
* fix: share WhatsApp 515 creds flush handling (#27910) (thanks @asyncjason)
---------
Co-authored-by: Jason Separovic <jason@wilma.dog>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Ayaan Zaidi <hi@obviy.us>
writeQueues was keyed by the raw workspaceDir-derived path before any
realpath resolution. Two callers using different spellings of the same
physical directory (a symlink and its target, or a relative vs absolute
path) therefore produced separate queue entries and both entered
appendRecord concurrently.
Inside appendRecord, withFileLock calls resolveNormalizedFilePath which
uses fs.realpath on the directory; both spellings resolve to the same
normalised path. If one chain is already in fn() — its entry set in
HELD_LOCKS — the second chain's acquireFileLock sees HELD_LOCKS hit for
the same normalised path and re-entrantly joins it. Both callbacks then
execute the read-modify-write cycle concurrently, and whichever writes
last overwrites the first, silently dropping one entry per collision.
Fix: call fs.realpath(memoryDir) immediately after fs.mkdir and use the
canonical path as both the writeQueues key and the appendRecord file
argument. A single canonical key means all in-process writers for the
same physical file are serialised through one queue regardless of how
the workspace path was spelled by the caller.
Test: symlink tmpDir to a second name and interleave concurrent
recordTokenUsage calls across both spellings. Asserts all N records
survive — regression guard for the path-alias queue split.
Three independent fixes bundled here because they came from the same
review pass.
── 1. Record lock owner identity beyond PID (file-lock) ──────────────
Stale-lock detection used only isPidAlive(), but PIDs are reusable.
On systems with small PID namespaces (containers, rapid restarts) a
crashed writer's PID can be reassigned to an unrelated live process,
causing isStaleLock to return false and the lock to appear held
indefinitely.
Fix: record the process start time (field 22 from /proc/{pid}/stat)
alongside pid and createdAt. On Linux, if the current holder's
startTime differs from the stored value the PID was recycled and the
lock is reclaimed immediately. On other platforms startTime is omitted
and the existing createdAt age-check (a reused PID inherits the old
timestamp, exceeding staleMs) remains as the fallback.
── 2. Restore mtime fallback for null/unparseable payloads (file-lock) ─
The previous fix treated null payload as immediately stale. But the
lock file is created (empty) by open('wx') before writeFile fills in
the JSON. A live writer still in that window has an empty file; marking
it stale immediately allows a second process to steal the lock and both
to enter fn() concurrently.
Fix: when payload is null, fall back to the file's mtime. A file
younger than staleMs may belong to a live writer and is left alone; a
file older than staleMs was definitely orphaned and is reclaimed. A
new test asserts that a freshly-created empty lock (recent mtime) is NOT
treated as stale.
── 3. Strip prerelease suffix before printf '%05d' (resolve-node.sh) ──
When an nvm install has a prerelease directory name (e.g.
v22.0.0-rc.1/bin/node), splitting on '.' leaves _pa as '0-rc.1'.
printf '%05d' then fails because '0-rc.1' is not an integer, and
set -euo pipefail aborts the hook before lint/format can run — the
opposite of what the nvm fallback is meant to achieve.
Fix: strip the longest non-digit suffix from each component before
printf: '0-rc.1' → '0', '14' → '14' (no-op for normal releases).
Uses POSIX parameter expansion so it works on both
GNU bash and macOS bash 3.x.
appendRecord wrote token-usage.json in place with a direct fs.writeFile
call; a crash or SIGKILL during that write left truncated JSON. Because
readJsonArray now throws on any non-ENOENT error (to prevent silent data
loss) and recordTokenUsage callers swallow the error via .catch(), one
corrupted write permanently disabled all future token logging until the
file was manually repaired.
The in-place-write bug was fixed in 8c162d0ba via a temp-file + atomic
rename approach, but usage-log.ts still carried its own private
withFileLock / isLockStale implementation. That inline lock had two
known bugs that were fixed in plugin-sdk/file-lock.ts but never applied
here:
1. isLockStale treated empty / unparseable lock content as 'not stale'
— a process that crashes between open('wx') and writeFile(pid)
leaves an empty .lock that appeared live forever, blocking all
future writers until it was manually removed.
2. No inode identity check before unlink: two waiters observing the
same stale lock could both call unlink; the slower one would
delete the faster one's freshly-acquired lock, letting both enter
fn() concurrently and race on the read-modify-write sequence.
Fix: import withFileLock from infra/file-lock.ts (which re-exports the
canonical plugin-sdk implementation) and remove the ~70-line inline lock.
APPEND_LOCK_OPTIONS reproduces the previous timeout/retry budget
(~100 × 50 ms ≈ 5 s) while gaining all fixes from plugin-sdk/file-lock.
The lock payload format changed from a plain PID string to the JSON
{pid, createdAt} envelope expected by the shared implementation; the
stale-lock integration test is updated to match.
Update 5 references to the old "Clawdbot" name in
skills/apple-reminders/SKILL.md and skills/imsg/SKILL.md.
Co-authored-by: imanisynapse <imanisynapse@gmail.com>
* feat: make compaction timeout configurable via agents.defaults.compaction.timeoutSeconds
The hardcoded 5-minute (300s) compaction timeout causes large sessions
to enter a death spiral where compaction repeatedly fails and the
session grows indefinitely. This adds agents.defaults.compaction.timeoutSeconds
to allow operators to override the compaction safety timeout.
Default raised to 900s (15min) which is sufficient for sessions up to
~400k tokens. The resolved timeout is also used for the session write
lock duration so locks don't expire before compaction completes.
Fixes#38233
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* test: add resolveCompactionTimeoutMs tests
Cover config resolution edge cases: undefined config, missing
compaction section, valid seconds, fractional values, zero,
negative, NaN, and Infinity.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: add timeoutSeconds to compaction Zod schema
The compaction object schema uses .strict(), so setting the new
timeoutSeconds config option would fail validation at startup.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: enforce integer constraint on compaction timeoutSeconds schema
Prevents sub-second values like 0.5 which would floor to 0ms and
cause immediate compaction timeout. Matches pattern of other
integer timeout fields in the schema.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: clamp compaction timeout to Node timer-safe maximum
Values above ~2.1B ms overflow Node's setTimeout to 1ms, causing
immediate timeout. Clamp to MAX_SAFE_TIMEOUT_MS matching the
pattern in agents/timeout.ts.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: add FIELD_LABELS entry for compaction timeoutSeconds
Maintains label/help parity invariant enforced by
schema.help.quality.test.ts.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: align compaction timeouts with abort handling
* fix: land compaction timeout handling (#46889) (thanks @asyncjason)
---------
Co-authored-by: Jason Separovic <jason@wilma.dog>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Ayaan Zaidi <hi@obviy.us>
* fix: fetch OpenRouter model capabilities at runtime for unknown models
When an OpenRouter model is not in the built-in static snapshot from
pi-ai, the fallback hardcodes input: ["text"], silently dropping images.
Query the OpenRouter API at runtime to detect actual capabilities
(image support, reasoning, context window) for models not in the
built-in list. Results are cached in memory for 1 hour. On API
failure/timeout, falls back to text-only (no regression).
* feat(openrouter): add disk cache for OpenRouter model capabilities
Persist the OpenRouter model catalog to ~/.openclaw/cache/openrouter-models.json
so it survives process restarts. Cache lookup order:
1. In-memory Map (instant)
2. On-disk JSON file (avoids network on restart)
3. OpenRouter API fetch (populates both layers)
Also triggers a background refresh when a model is not found in the cache,
in case it was newly added to OpenRouter.
* refactor(openrouter): remove pre-warm, use pure lazy-load with disk cache
- Remove eager ensureOpenRouterModelCache() from run.ts
- Remove TTL — model capabilities are stable, no periodic re-fetching
- Cache lookup: in-memory → disk → API fetch (only when needed)
- API is only called when no cache exists or a model is not found
- Disk cache persists across gateway restarts
* fix(openrouter): address review feedback
- Fix timer leak: move clearTimeout to finally block
- Fix modality check: only check input side of "->" separator to avoid
matching image-generation models (text->image)
- Use resolveStateDir() instead of hardcoded homedir()/.openclaw
- Separate cache dir and filename constants
- Add utf-8 encoding to writeFileSync for consistency
- Add data validation when reading disk cache
* ci: retrigger checks
* fix: preload unknown OpenRouter model capabilities before resolve
* fix: accept top-level OpenRouter max token metadata
* fix: update changelog for OpenRouter runtime capability lookup (#45824) (thanks @DJjjjhao)
* fix: avoid redundant OpenRouter refetches and preserve suppression guards
---------
Co-authored-by: Ayaan Zaidi <hi@obviy.us>
TOCTOU in the stale-lock branch: isStaleLock(lockPath) returning true
is evaluated under several awaits before unlink is called. If two
waiters (same process or different processes) both observe the same
stale file, waiter A can unlink, create a fresh lock, and start fn(),
then waiter B's delayed unlink removes A's fresh file. B then wins
open(O_EXCL) and both A and B execute fn() concurrently, breaking the
read-modify-write guarantee for token-usage.json.
Fix: snapshot the lock file's inode immediately after the EEXIST, then
re-stat right before the unlink. If the inode changed between the two
stats, a concurrent waiter already reclaimed the stale file and wrote a
fresh lock; leave the new file alone and continue to the next
open(O_EXCL) attempt. The three-outcome table:
staleIno == -1 (file gone by the time we stat)
→ skip unlink, continue: another waiter already handled it
staleIno == currentIno (same stale file still there)
→ safe to unlink; we and the other waiter(s) racing here all call
rm(force:true) — the first succeeds, the rest get silent ENOENT
staleIno != currentIno (inode changed — fresh lock in place)
→ do NOT unlink; continue and let isStaleLock reject the live lock
A note on the in-loop HELD_LOCKS re-check that was considered: joining
the existing holder inside the retry loop would allow two independent
concurrent callers to run fn() simultaneously, which breaks mutual
exclusion. HELD_LOCKS reentrant join is intentionally restricted to the
entry point of acquireFileLock (recursive/reentrant callers only).
Tests added:
- two concurrent waiters on a stale lock never overlap inside fn()
(maxInside assertion, not just result set)
- existing stale-reclaim tests continue to pass
The lock file is created (empty) by open("wx") before pid/createdAt
are written by the subsequent writeFile. A process that crashes in this
narrow window leaves an empty .lock file whose content readLockPayload()
cannot parse (returns null).
Previously isStaleLock skipped both the pid-alive and the age checks
when payload was null, falling through to the mtime stat. If the mtime
was still within staleMs the function returned false, making the empty
lock appear live indefinitely — every future writer would time out and
silently drop its usage record until the file was manually deleted.
Fix: treat null payload (empty, truncated, or non-JSON content) as
stale immediately. Such a file could only have been left by a process
that never completed the write, so it is safe to reclaim without
waiting for the mtime timeout.
The mtime stat fallback is also removed: its only useful case was
exactly this null-payload scenario (it was redundant when payload is
valid, since the pid-alive and createdAt-age checks already cover the
live-lock and aged-out-lock cases).
Tests added:
- empty lock file → reclaimed, callback runs
- truncated/invalid JSON lock file → reclaimed
- pid field not a number → reclaimed
sort -V is a GNU extension; BSD sort on macOS does not support it. When
node is absent from PATH and the nvm fallback runs, set -euo pipefail
causes the unsupported flag to abort the hook before lint/format can
run, blocking commits on macOS.
Replace the sort -V | tail -1 pipeline with a Bash for-loop that
zero-pads each semver component to five digits and emits a tab-delimited
key+path line. Plain sort + tail -1 + cut then selects the highest
semantic version — no GNU-only flags required.
Smoke-tested with v18 vs v22 paths; v22 is correctly selected on both
GNU and BSD sort.