The test verifies that cooldownUntil IS cleared when it equals exactly
`now` (>= comparison), but the test name said "does not clear". Fixed
the name to match the actual assertion behavior.
When an auth profile hits a rate limit, `errorCount` is incremented and
`cooldownUntil` is set with exponential backoff. After the cooldown
expires, the time-based check correctly returns false — but `errorCount`
persists. The next transient failure immediately escalates to a much
longer cooldown because the backoff formula uses the stale count:
60s × 5^(errorCount-1), max 1h
This creates a positive feedback loop where profiles appear permanently
stuck after rate limits, requiring manual JSON editing to recover.
Add `clearExpiredCooldowns()` which sweeps all profiles on every call to
`resolveAuthProfileOrder()` and clears expired `cooldownUntil` /
`disabledUntil` values along with resetting `errorCount` and
`failureCounts` — giving the profile a fair retry window (circuit-breaker
half-open → closed transition).
Key design decisions:
- `cooldownUntil` and `disabledUntil` handled independently (a profile
can have both; only the expired one is cleared)
- `errorCount` reset only when ALL unusable windows have expired
- `lastFailureAt` preserved for the existing failureWindowMs decay logic
- In-memory mutation; disk persistence happens lazily on the next store
write, matching the existing save pattern
Fixes#3604
Related: #13623, #15851, #11972, #8434
* Fix subagent announce race and timeout handling
Bug 1: Subagent announce fires before model failover retries finish
- Problem: CLI provider emitted lifecycle error on each attempt, causing
subagent registry to prematurely call beginSubagentCleanup() and announce
with incorrect status before failover retries completed
- Fix: Removed lifecycle error emission from CLI provider's attempt-level
.catch() in agent-runner-execution.ts. Errors still propagate to
runWithModelFallback for retry, but no intermediate lifecycle events
are emitted. Only the final outcome (after all retries) emits lifecycle
events.
Bug 2: Hard 600s per-prompt timeout ignores runTimeoutSeconds=0
- Problem: When runTimeoutSeconds=0 (meaning 'no timeout'), the code
returned the default 600s timeout instead of respecting the 0 setting
- Fix: Modified resolveAgentTimeoutMs() to treat 0 as 'no timeout' and
return a very large timeout value (30 days) instead of the default.
This avoids setTimeout issues with Infinity while effectively providing
unlimited time for long-running tasks.
* fix: emit lifecycle:error for CLI failures (#6621) (thanks @tyler6204)
* chore: satisfy format/lint gates (#6621) (thanks @tyler6204)
* fix: restore build after upstream type changes (#6621) (thanks @tyler6204)
* test: fix createSystemPromptOverride tests to match new return type (#6621) (thanks @tyler6204)
When a secondary agent's OAuth token expires and refresh fails, the agent
would error out even if the main agent had fresh, valid credentials for
the same profile.
This fix adds a fallback mechanism that:
1. Detects when OAuth refresh fails for a secondary agent (agentDir is set)
2. Checks if the main agent has fresh credentials for the same profileId
3. If so, copies those credentials to the secondary agent and uses them
4. Logs the inheritance for debugging
This prevents the situation where users have to manually copy auth-profiles.json
between agent directories when tokens expire at different times.
Fixes: Secondary agents failing with 'OAuth token refresh failed' while main
agent continues to work fine.