16 Commits

Author SHA1 Message Date
Ayane
76ed274aad fix(agents): trigger model failover on connection-refused and network-unreachable errors
Previously, only ETIMEDOUT / ESOCKETTIMEDOUT / ECONNRESET / ECONNABORTED
were recognised as failover-worthy network errors. Connection-level
failures such as ECONNREFUSED (server down), ENETUNREACH / EHOSTUNREACH
(network disconnected), ENETRESET, and EAI_AGAIN (DNS failure) were
treated as unknown errors and did not advance the fallback chain.

This is particularly impactful when a local fallback model (e.g. Ollama)
is configured: if the remote provider is unreachable due to a network
outage, the gateway should fall back to the local model instead of
returning an error to the user.

Add the missing error codes to resolveFailoverReasonFromError() and
corresponding e2e tests.

Closes #18868
2026-03-02 02:08:27 +00:00
Frank Yang
ed86252aa5
fix: handle CLI session expired errors gracefully instead of crashing gateway (#31090)
* fix: handle CLI session expired errors gracefully

- Add session_expired to FailoverReason type
- Add isCliSessionExpiredErrorMessage to detect expired CLI sessions
- Modify runCliAgent to retry with new session when session expires
- Update agentCommand to clear expired session IDs from session store
- Add proper error handling to prevent gateway crashes on expired sessions

Fixes #30986

* fix: add session_expired to AuthProfileFailureReason and missing log import

* fix: type cli-runner usage field to match EmbeddedPiAgentMeta

* fix: harden CLI session-expiry recovery handling

* build: regenerate host env security policy swift

---------

Co-authored-by: Peter Steinberger <steipete@gmail.com>
2026-03-02 01:11:05 +00:00
Aleksandrs Tihenko
c0026274d9
fix(auth): distinguish revoked API keys from transient auth errors (#25754)
Merged via /review-pr -> /prepare-pr -> /merge-pr.

Prepared head SHA: 8f9c07a200644284e11adae76368adab40c5fa4e
Co-authored-by: rrenamed <87486610+rrenamed@users.noreply.github.com>
Co-authored-by: gumadeiras <5599352+gumadeiras@users.noreply.github.com>
Reviewed-by: @gumadeiras
2026-02-25 19:47:16 -05:00
taw0002
3c57bf4c85
fix: treat HTTP 502/503/504 as failover-eligible (timeout reason) (#21017)
* fix: treat HTTP 502/503/504 as failover-eligible (timeout reason)

When a model API returns 502 Bad Gateway, 503 Service Unavailable, or
504 Gateway Timeout, the error object carries the status code directly.
resolveFailoverReasonFromError() only checked 402/429/401/403/408/400,
so 5xx server errors fell through to message-based classification which
requires the status code to appear at the start of the error message.

Many API SDKs (Google, Anthropic) set err.status = 503 without prefixing
the message with '503', so the message classifier never matched and
failover never triggered — the run retried the same broken model.

Add 502/503/504 to the status-code branch, returning 'timeout' (matching
the existing behavior of isTransientHttpError in the message classifier).

Fixes #20999

* Changelog: add failover 502/503/504 note with credits

* Failover: classify HTTP 504 as transient in message parser

* Changelog: credit taw0002 and vincentkoc for failover fix

---------

Co-authored-by: Vincent Koc <vincentkoc@ieee.org>
2026-02-23 03:01:57 -05:00
mudrii
7ecfc1d93c
fix(auth): bidirectional mode/type compat + sync OAuth to all agents (#12692)
Merged via /review-pr -> /prepare-pr -> /merge-pr.

Prepared head SHA: 2dee8e1174e637e50d10bf7020f1de2990b804dc
Co-authored-by: mudrii <220262+mudrii@users.noreply.github.com>
Co-authored-by: obviyus <22031114+obviyus@users.noreply.github.com>
Reviewed-by: @obviyus
2026-02-20 16:01:09 +05:30
Protocol Zero
2af3415fac
fix: treat HTTP 503 as failover-eligible for LLM provider errors (#21086)
* fix: treat HTTP 503 as failover-eligible for LLM provider errors

When LLM SDKs wrap 503 responses, the leading "503" prefix is lost
(e.g. Google Gemini returns "high demand" / "UNAVAILABLE" without a
numeric prefix). The existing isTransientHttpError only matches
messages starting with "503 ...", so these wrapped errors silently
skip failover — no profile rotation, no model fallback.

This patch closes that gap:

- resolveFailoverReasonFromError: map HTTP status 503 → rate_limit
  (covers structured error objects with a status field)
- ERROR_PATTERNS.overloaded: add /\b503\b/, "service unavailable",
  "high demand" (covers message-only classification when the leading
  status prefix is absent)

Existing isTransientHttpError behavior is unchanged; these additions
are complementary and only fire for errors that previously fell
through unclassified.

* fix: address review feedback — drop /\b503\b/ pattern, add test coverage

- Remove `/\b503\b/` from ERROR_PATTERNS.overloaded to resolve the
  semantic inconsistency noted by reviewers: `isTransientHttpError`
  already handles messages prefixed with "503" (→ "timeout"), so a
  redundant overloaded pattern would classify the same class of errors
  differently depending on message formatting.

- Keep "service unavailable" and "high demand" patterns — these are the
  real gap-fillers for SDK-rewritten messages that lack a numeric prefix.

- Add test case for JSON-wrapped 503 error body containing "overloaded"
  to strengthen coverage.

* fix: unify 503 classification — status 503 → timeout (consistent with isTransientHttpError)

resolveFailoverReasonFromError previously mapped status 503 → "rate_limit",
while the string-based isTransientHttpError mapped "503 ..." → "timeout".

Align both paths: structured {status: 503} now also returns "timeout",
matching the existing transient-error convention. Both reasons are
failover-eligible, so runtime behavior is unchanged.

---------

Co-authored-by: Vincent Koc <vincentkoc@ieee.org>
2026-02-19 12:45:09 -08:00
Sebastian
fbda9a93fd fix(failover): align abort timeout detection and regressions 2026-02-16 21:00:27 -05:00
Daniel Sauer
12ce358da5 fix(failover): recognize 'abort' stop reason as timeout for model fallback
When streaming providers (GLM, OpenRouter, etc.) return 'stop reason: abort'
due to stream interruption, OpenClaw's failover mechanism did not recognize
this as a timeout condition. This prevented fallback models from being
triggered, leaving users with failed requests instead of graceful failover.

Changes:
- Add abort patterns to ERROR_PATTERNS.timeout in pi-embedded-helpers/errors.ts
- Extend TIMEOUT_HINT_RE regex to include abort patterns in failover-error.ts

Fixes #18453

Co-authored-by: James <james@openclaw.ai>
2026-02-16 23:49:51 +01:00
Oren
71b4be8799
fix: handle 400 status in failover to enable model fallback (#1879) 2026-02-08 23:12:06 -08:00
cpojer
5ceff756e1
chore: Enable "curly" rule to avoid single-statement if confusion/errors. 2026-01-31 16:19:20 +09:00
Luke
be1cdc9370
fix(agents): treat provider request-aborted as timeout for fallback (#1576)
* fix(agents): treat request-aborted as timeout for fallback

* test(e2e): add provider timeout fallback
2026-01-24 11:27:24 +00:00
Peter Steinberger
ec27c813cc fix(fallback): handle timeout aborts
Co-authored-by: Mykyta Bozhenko <21245729+cheeeee@users.noreply.github.com>
2026-01-18 07:52:44 +00:00
Peter Steinberger
c379191f80 chore: migrate to oxlint and oxfmt
Co-authored-by: Christoph Nakazawa <christoph.pojer@gmail.com>
2026-01-14 15:02:19 +00:00
Peter Steinberger
53ec8e36cb refactor: centralize failover error parsing 2026-01-10 01:26:06 +01:00
Peter Steinberger
402c35b91c refactor(agents): centralize failover normalization 2026-01-09 22:15:06 +01:00
Peter Steinberger
c27b1441f7 fix(auth): billing backoff + cooldown UX 2026-01-09 22:00:14 +01:00