fix(file-lock): guarantee post-reclaim open attempt on last retry slot

When acquireFileLock detects a stale lock on the final loop iteration
(attempt === attempts - 1), the subsequent `continue` increments attempt
to `attempts`, causing the loop to exit and throw 'file lock timeout'
without ever calling open(O_EXCL) on the now-clear path.

This is the crash-recovery bug reported in the P2 badge: an empty/partial
.lock file left by a crash between open("wx") and writeFile becomes
reclaimable on the last slot, but the call still throws timeout and the
usage record is silently dropped.

Fix: when on the last slot, step attempt back by one before continue so
the upcoming += 1 nets to zero — guaranteeing at least one post-reclaim
open(O_EXCL) attempt.  No extra sleep budget is consumed; retries are
only charged for backoff-sleep iterations, not reclaim-only work.

Adds a regression test: retries=0 (attempts=1), stale empty lock file
aged past staleMs — withFileLock must succeed, not throw timeout.
This commit is contained in:
jiarung 2026-03-17 06:57:09 +00:00
parent aab1ed7b1b
commit 14303dac74
2 changed files with 40 additions and 0 deletions

View File

@ -187,6 +187,30 @@ describe("withFileLock", () => {
},
);
it("reclaims a stale empty lock file even when detected on the very last retry (retries=0)", async () => {
// Regression: when retries=0 (attempts=1), a stale lock detected on
// attempt 0 caused `continue` to increment attempt to 1, exiting the loop
// and throwing timeout without ever calling open(O_EXCL) on the now-clear
// path. This reproduces the token-usage.json.lock crash-recovery bug where
// an empty .lock file left by a crash is reclaimable only on the last slot.
const lockPath = `${targetFile}.lock`;
await fs.mkdir(path.dirname(targetFile), { recursive: true });
await fs.writeFile(lockPath, ""); // empty — crash between open("wx") and writeFile
await ageFile(lockPath, LOCK_OPTIONS.stale + 1_000); // age past stale threshold
let ran = false;
await expect(
withFileLock(
targetFile,
{ ...LOCK_OPTIONS, retries: { ...LOCK_OPTIONS.retries, retries: 0 } },
async () => {
ran = true;
},
),
).resolves.toBeUndefined();
expect(ran).toBe(true);
});
it("two concurrent waiters on a stale lock never overlap inside fn()", async () => {
// Plant a stale lock (dead PID, old timestamp) so both waiters will
// simultaneously enter the stale-reclaim branch. The inode guard must

View File

@ -215,6 +215,22 @@ export async function acquireFileLock(
// Retry open(O_EXCL) regardless: either we removed the stale lock or
// a concurrent waiter already handled it; either way, the path is now
// either free or holds a fresh lock that isStaleLock will reject.
//
// Guard: the for-loop's `attempt += 1` runs after every `continue`,
// consuming a retry slot. If we are already on the last slot
// (attempt === attempts - 1), that increment exits the loop and we
// throw timeout without ever re-trying open(O_EXCL) on the now-clear
// path. This is the crash-recovery bug: an empty/partial .lock file
// (crash between open("wx") and writeFile) that becomes reclaimable
// on the last iteration causes the record to be silently dropped.
//
// Fix: when on the last slot, step attempt back by one so the
// upcoming += 1 nets to zero, guaranteeing at least one post-reclaim
// open(O_EXCL) attempt. No extra sleep budget is consumed — we only
// charge a retry for backoff sleeps, not reclaim-only work.
if (attempt >= attempts - 1) {
attempt -= 1;
}
continue;
}