docs(plugins): document capability ownership model

2026-03-16 18:50:03 -07:00 · 2026-03-16 18:50:03 -07:00 · 6da9ba3267
commit 6da9ba3267
parent 662031a88e
4 changed files with 238 additions and 41 deletions
--- a/docs/plugins/voice-call.md
+++ b/docs/plugins/voice-call.md
@ -204,7 +204,7 @@ Example with a stable public host:

 ## TTS for calls

-Voice Call uses the core `messages.tts` configuration (OpenAI or ElevenLabs) for
+Voice Call uses the core `messages.tts` configuration for
 streaming speech on calls. You can override it under the plugin config with the
 **same shape** — it deep‑merges with `messages.tts`.

@ -222,7 +222,7 @@ streaming speech on calls. You can override it under the plugin config with the

 Notes:

- **Edge TTS is ignored for voice calls** (telephony audio needs PCM; Edge output is unreliable).
+- **Microsoft speech is ignored for voice calls** (telephony audio needs PCM; the current Microsoft transport does not expose telephony PCM output).
 - Core TTS is used when Twilio media streaming is enabled; otherwise calls fall back to provider native voices.

 ### More examples
--- a/docs/tools/plugin.md
+++ b/docs/tools/plugin.md
@ -97,6 +97,76 @@ The important design boundary:
 That split lets OpenClaw validate config, explain missing/disabled plugins, and
 build UI/schema hints before the full runtime is active.

+## Capability ownership model
+
+OpenClaw treats a native plugin as the ownership boundary for a **company** or a
+**feature**, not as a grab bag of unrelated integrations.
+
+That means:
+
+- a company plugin should usually own all of that company's OpenClaw-facing
+  surfaces
+- a feature plugin should usually own the full feature surface it introduces
+- channels should consume shared core capabilities instead of re-implementing
+  provider behavior ad hoc
+
+Examples:
+
+- the bundled `openai` plugin owns OpenAI model-provider behavior and OpenAI
+  speech behavior
+- the bundled `elevenlabs` plugin owns ElevenLabs speech behavior
+- the bundled `microsoft` plugin owns Microsoft speech behavior
+- the `voice-call` plugin is a feature plugin: it owns call transport, tools,
+  CLI, routes, and runtime, but it consumes core TTS/STT capability instead of
+  inventing a second speech stack
+
+The intended end state is:
+
+- OpenAI lives in one plugin even if it spans text models, speech, images, and
+  future video
+- another vendor can do the same for its own surface area
+- channels do not care which vendor plugin owns the provider; they consume the
+  shared capability contract exposed by core
+
+This is the key distinction:
+
+- **plugin** = ownership boundary
+- **capability** = core contract that multiple plugins can implement or consume
+
+So if OpenClaw adds a new domain such as video, the first question is not
+"which provider should hardcode video handling?" The first question is "what is
+the core video capability contract?" Once that contract exists, vendor plugins
+can register against it and channel/feature plugins can consume it.
+
+If the capability does not exist yet, the right move is usually:
+
+1. define the missing capability in core
+2. expose it through the plugin API/runtime in a typed way
+3. wire channels/features against that capability
+4. let vendor plugins register implementations
+
+This keeps ownership explicit while avoiding core behavior that depends on a
+single vendor or a one-off plugin-specific code path.
+
+### Capability layering
+
+Use this mental model when deciding where code belongs:
+
+- **core capability layer**: shared orchestration, policy, fallback, config
+  merge rules, delivery semantics, and typed contracts
+- **vendor plugin layer**: vendor-specific APIs, auth, model catalogs, speech
+  synthesis, image generation, future video backends, usage endpoints
+- **channel/feature plugin layer**: Slack/Discord/voice-call/etc. integration
+  that consumes core capabilities and presents them on a surface
+
+For example, TTS follows this shape:
+
+- core owns reply-time TTS policy, fallback order, prefs, and channel delivery
+- `openai`, `elevenlabs`, and `microsoft` own synthesis implementations
+- `voice-call` consumes the telephony TTS runtime helper
+
+That same pattern should be preferred for future capabilities.
+
 ## Compatible bundles

 OpenClaw also recognizes two compatible external bundle layouts:
@ -193,6 +263,8 @@ Important trust note:
 - Model Studio provider catalog — bundled as `modelstudio` (enabled by default)
 - Moonshot provider runtime — bundled as `moonshot` (enabled by default)
 - NVIDIA provider catalog — bundled as `nvidia` (enabled by default)
+- ElevenLabs speech provider — bundled as `elevenlabs` (enabled by default)
+- Microsoft speech provider — bundled as `microsoft` (enabled by default; legacy `edge` input maps here)
 - OpenAI provider runtime — bundled as `openai` (enabled by default; owns both `openai` and `openai-codex`)
 - OpenCode Go provider capabilities — bundled as `opencode-go` (enabled by default)
 - OpenCode Zen provider capabilities — bundled as `opencode` (enabled by default)
@ -218,6 +290,8 @@ Native OpenClaw plugins can register:
 - Gateway HTTP routes
 - Agent tools
 - CLI commands
+- Speech providers
+- Web search providers
 - Background services
 - Context engines
 - Provider auth flows and model catalogs
@ -229,6 +303,62 @@ Native OpenClaw plugins can register:
 Native OpenClaw plugins run **in‑process** with the Gateway, so treat them as trusted code.
 Tool authoring guide: [Plugin agent tools](/plugins/agent-tools).

+Think of these registrations as **capability claims**. A plugin is not supposed
+to reach into random internals and "just make it work." It should register
+against explicit surfaces that OpenClaw understands, validates, and can expose
+consistently across config, onboarding, status, docs, and runtime behavior.
+
+## Contracts and enforcement
+
+The plugin API surface is intentionally typed and centralized in
+`OpenClawPluginApi`. That contract defines the supported registration points and
+the runtime helpers a plugin may rely on.
+
+Why this matters:
+
+- plugin authors get one stable internal standard
+- core can reject duplicate ownership such as two plugins registering the same
+  provider id
+- startup can surface actionable diagnostics for malformed registration
+- contract tests can enforce bundled-plugin ownership and prevent silent drift
+
+There are two layers of enforcement:
+
+1. **runtime registration enforcement**
+   The plugin registry validates registrations as plugins load. Examples:
+   duplicate provider ids, duplicate speech provider ids, and malformed
+   registrations produce plugin diagnostics instead of undefined behavior.
+2. **contract tests**
+   Bundled plugins are captured in contract registries during test runs so
+   OpenClaw can assert ownership explicitly. Today this is used for model
+   providers, web search providers, and bundled registration ownership.
+
+The practical effect is that OpenClaw knows, up front, which plugin owns which
+surface. That lets core and channels compose seamlessly because ownership is
+declared, typed, and testable rather than implicit.
+
+### What belongs in a contract
+
+Good plugin contracts are:
+
+- typed
+- small
+- capability-specific
+- owned by core
+- reusable by multiple plugins
+- consumable by channels/features without vendor knowledge
+
+Bad plugin contracts are:
+
+- vendor-specific policy hidden in core
+- one-off plugin escape hatches that bypass the registry
+- channel code reaching straight into a vendor implementation
+- ad hoc runtime objects that are not part of `OpenClawPluginApi` or
+  `api.runtime`
+
+When in doubt, raise the abstraction level: define the capability first, then
+let plugins plug into it.
+
 ## Provider runtime hooks

 Provider plugins now have two layers:
@ -530,9 +660,36 @@ const result = await api.runtime.tts.textToSpeechTelephony({

 Notes:

- Uses core `messages.tts` configuration (OpenAI or ElevenLabs).
+- Uses core `messages.tts` configuration and provider selection.
 - Returns PCM audio buffer + sample rate. Plugins must resample/encode for providers.
- Edge TTS is not supported for telephony.
+- OpenAI and ElevenLabs support telephony today. Microsoft does not.
+
+Plugins can also register speech providers via `api.registerSpeechProvider(...)`.
+
+```ts
+api.registerSpeechProvider({
+  id: "acme-speech",
+  label: "Acme Speech",
+  isConfigured: ({ config }) => Boolean(config.messages?.tts),
+  synthesize: async (req) => {
+    return {
+      audioBuffer: Buffer.from([]),
+      outputFormat: "mp3",
+      fileExtension: ".mp3",
+      voiceCompatible: false,
+    };
+  },
+});
+```
+
+Notes:
+
+- Keep TTS policy, fallback, and reply delivery in core.
+- Use speech providers for vendor-owned synthesis behavior.
+- Legacy Microsoft `edge` input is normalized to the `microsoft` provider id.
+- The preferred ownership model is company-oriented: one vendor plugin can own
+  text, speech, image, and future media providers as OpenClaw adds those
+  capability contracts.

 For STT/transcription, plugins can call:

@ -1110,12 +1267,49 @@ Plugins export either:
 - `on(...)` for typed lifecycle hooks
 - `registerChannel`
 - `registerProvider`
+- `registerSpeechProvider`
+- `registerWebSearchProvider`
 - `registerHttpRoute`
 - `registerCommand`
 - `registerCli`
 - `registerContextEngine`
 - `registerService`

+In practice, `register(api)` is also where a plugin declares **ownership**.
+That ownership should map cleanly to either:
+
+- a vendor surface such as OpenAI, ElevenLabs, or Microsoft
+- a feature surface such as Voice Call
+
+Avoid splitting one vendor's capabilities across unrelated plugins unless there
+is a strong product reason to do so. The default should be one plugin per
+vendor/feature, with core capability contracts separating shared orchestration
+from vendor-specific behavior.
+
+## Adding a new capability
+
+When a plugin needs behavior that does not fit the current API, do not bypass
+the plugin system with a private reach-in. Add the missing capability.
+
+Recommended sequence:
+
+1. define the core contract
+   Decide what shared behavior core should own: policy, fallback, config merge,
+   lifecycle, channel-facing semantics, and runtime helper shape.
+2. add typed plugin registration/runtime surfaces
+   Extend `OpenClawPluginApi` and/or `api.runtime` with the smallest useful
+   typed seam.
+3. wire core + channel/feature consumers
+   Channels and feature plugins should consume the new capability through core,
+   not by importing a vendor implementation directly.
+4. register vendor implementations
+   Vendor plugins then register their backends against the capability.
+5. add contract coverage
+   Add tests so ownership and registration shape stay explicit over time.
+
+This is how OpenClaw stays opinionated without becoming hardcoded to one
+provider's worldview.
+
 Context engine plugins can also register a runtime-owned context manager:

 ```ts
--- a/docs/tts.md
+++ b/docs/tts.md
@ -9,26 +9,27 @@ title: "Text-to-Speech"

 # Text-to-speech (TTS)

-OpenClaw can convert outbound replies into audio using ElevenLabs, OpenAI, or Edge TTS.
+OpenClaw can convert outbound replies into audio using ElevenLabs, Microsoft, or OpenAI.
 It works anywhere OpenClaw can send audio; Telegram gets a round voice-note bubble.

 ## Supported services

 - **ElevenLabs** (primary or fallback provider)
+- **Microsoft** (primary or fallback provider; current bundled implementation uses `node-edge-tts`, default when no API keys)
 - **OpenAI** (primary or fallback provider; also used for summaries)
- **Edge TTS** (primary or fallback provider; uses `node-edge-tts`, default when no API keys)

-### Edge TTS notes
+### Microsoft speech notes

-Edge TTS uses Microsoft Edge's online neural TTS service via the `node-edge-tts`
-library. It's a hosted service (not local), uses Microsoft’s endpoints, and does
-not require an API key. `node-edge-tts` exposes speech configuration options and
-output formats, but not all options are supported by the Edge service. citeturn2search0
+The bundled Microsoft speech provider currently uses Microsoft Edge's online
+neural TTS service via the `node-edge-tts` library. It's a hosted service (not
+local), uses Microsoft endpoints, and does not require an API key.
+`node-edge-tts` exposes speech configuration options and output formats, but
+not all options are supported by the service. Legacy config and directive input
+using `edge` still works and is normalized to `microsoft`.

-Because Edge TTS is a public web service without a published SLA or quota, treat it
-as best-effort. If you need guaranteed limits and support, use OpenAI or ElevenLabs.
-Microsoft's Speech REST API documents a 10‑minute audio limit per request; Edge TTS
-does not publish limits, so assume similar or lower limits. citeturn0search3
+Because this path is a public web service without a published SLA or quota,
+treat it as best-effort. If you need guaranteed limits and support, use OpenAI
+or ElevenLabs.

 ## Optional keys

@ -37,8 +38,9 @@ If you want OpenAI or ElevenLabs:
 - `ELEVENLABS_API_KEY` (or `XI_API_KEY`)
 - `OPENAI_API_KEY`

-Edge TTS does **not** require an API key. If no API keys are found, OpenClaw defaults
-to Edge TTS (unless disabled via `messages.tts.edge.enabled=false`).
+Microsoft speech does **not** require an API key. If no API keys are found,
+OpenClaw defaults to Microsoft (unless disabled via
+`messages.tts.microsoft.enabled=false` or `messages.tts.edge.enabled=false`).

 If multiple providers are configured, the selected provider is used first and the others are fallback options.
 Auto-summary uses the configured `summaryModel` (or `agents.defaults.model.primary`),
@ -58,7 +60,7 @@ so that provider must also be authenticated if you enable summaries.
 No. Auto‑TTS is **off** by default. Enable it in config with
 `messages.tts.auto` or per session with `/tts always` (alias: `/tts on`).

-Edge TTS **is** enabled by default once TTS is on, and is used automatically
+Microsoft speech **is** enabled by default once TTS is on, and is used automatically
 when no OpenAI or ElevenLabs API keys are available.

 ## Config
@ -118,15 +120,15 @@ Full schema is in [Gateway configuration](/gateway/configuration).
 }
 ```

-### Edge TTS primary (no API key)
+### Microsoft primary (no API key)

 ```json5
 {
  messages: {
    tts: {
      auto: "always",
-      provider: "edge",
-      edge: {
+      provider: "microsoft",
+      microsoft: {
        enabled: true,
        voice: "en-US-MichelleNeural",
        lang: "en-US",
@ -139,13 +141,13 @@ Full schema is in [Gateway configuration](/gateway/configuration).
 }
 ```

-### Disable Edge TTS
+### Disable Microsoft speech

 ```json5
 {
  messages: {
    tts: {
-      edge: {
+      microsoft: {
        enabled: false,
      },
    },
@ -205,9 +207,10 @@ Then run:
  - `tagged` only sends audio when the reply includes `[[tts]]` tags.
 - `enabled`: legacy toggle (doctor migrates this to `auto`).
 - `mode`: `"final"` (default) or `"all"` (includes tool/block replies).
- `provider`: `"elevenlabs"`, `"openai"`, or `"edge"` (fallback is automatic).
+- `provider`: speech provider id such as `"elevenlabs"`, `"microsoft"`, or `"openai"` (fallback is automatic).
 - If `provider` is **unset**, OpenClaw prefers `openai` (if key), then `elevenlabs` (if key),
-  otherwise `edge`.
+  otherwise `microsoft`.
+- Legacy `provider: "edge"` still works and is normalized to `microsoft`.
 - `summaryModel`: optional cheap model for auto-summary; defaults to `agents.defaults.model.primary`.
  - Accepts `provider/model` or a configured model alias.
 - `modelOverrides`: allow the model to emit TTS directives (on by default).
@ -227,15 +230,16 @@ Then run:
 - `elevenlabs.applyTextNormalization`: `auto|on|off`
 - `elevenlabs.languageCode`: 2-letter ISO 639-1 (e.g. `en`, `de`)
 - `elevenlabs.seed`: integer `0..4294967295` (best-effort determinism)
- `edge.enabled`: allow Edge TTS usage (default `true`; no API key).
- `edge.voice`: Edge neural voice name (e.g. `en-US-MichelleNeural`).
- `edge.lang`: language code (e.g. `en-US`).
- `edge.outputFormat`: Edge output format (e.g. `audio-24khz-48kbitrate-mono-mp3`).
-  - See Microsoft Speech output formats for valid values; not all formats are supported by Edge.
- `edge.rate` / `edge.pitch` / `edge.volume`: percent strings (e.g. `+10%`, `-5%`).
- `edge.saveSubtitles`: write JSON subtitles alongside the audio file.
- `edge.proxy`: proxy URL for Edge TTS requests.
- `edge.timeoutMs`: request timeout override (ms).
+- `microsoft.enabled`: allow Microsoft speech usage (default `true`; no API key).
+- `microsoft.voice`: Microsoft neural voice name (e.g. `en-US-MichelleNeural`).
+- `microsoft.lang`: language code (e.g. `en-US`).
+- `microsoft.outputFormat`: Microsoft output format (e.g. `audio-24khz-48kbitrate-mono-mp3`).
+  - See Microsoft Speech output formats for valid values; not all formats are supported by the bundled Edge-backed transport.
+- `microsoft.rate` / `microsoft.pitch` / `microsoft.volume`: percent strings (e.g. `+10%`, `-5%`).
+- `microsoft.saveSubtitles`: write JSON subtitles alongside the audio file.
+- `microsoft.proxy`: proxy URL for Microsoft speech requests.
+- `microsoft.timeoutMs`: request timeout override (ms).
+- `edge.*`: legacy alias for the same Microsoft settings.

 ## Model-driven overrides (default on)

@ -260,7 +264,7 @@ Here you go.

 Available directive keys (when enabled):

- `provider` (`openai` | `elevenlabs` | `edge`, requires `allowProvider: true`)
+- `provider` (registered speech provider id, for example `openai`, `elevenlabs`, or `microsoft`; requires `allowProvider: true`)
 - `voice` (OpenAI voice) or `voiceId` (ElevenLabs)
 - `model` (OpenAI TTS model or ElevenLabs model id)
 - `stability`, `similarityBoost`, `style`, `speed`, `useSpeakerBoost`
@ -319,13 +323,12 @@ These override `messages.tts.*` for that host.
  - 48kHz / 64kbps is a good voice-note tradeoff and required for the round bubble.
 - **Other channels**: MP3 (`mp3_44100_128` from ElevenLabs, `mp3` from OpenAI).
  - 44.1kHz / 128kbps is the default balance for speech clarity.
- **Edge TTS**: uses `edge.outputFormat` (default `audio-24khz-48kbitrate-mono-mp3`).
-  - `node-edge-tts` accepts an `outputFormat`, but not all formats are available
-    from the Edge service. citeturn2search0
-  - Output format values follow Microsoft Speech output formats (including Ogg/WebM Opus). citeturn1search0
+- **Microsoft**: uses `microsoft.outputFormat` (default `audio-24khz-48kbitrate-mono-mp3`).
+  - The bundled transport accepts an `outputFormat`, but not all formats are available from the service.
+  - Output format values follow Microsoft Speech output formats (including Ogg/WebM Opus).
  - Telegram `sendVoice` accepts OGG/MP3/M4A; use OpenAI/ElevenLabs if you need
    guaranteed Opus voice notes. citeturn1search1
-  - If the configured Edge output format fails, OpenClaw retries with MP3.
+  - If the configured Microsoft output format fails, OpenClaw retries with MP3.

 OpenAI/ElevenLabs formats are fixed; Telegram expects Opus for voice-note UX.

--- a/extensions/voice-call/README.md
+++ b/extensions/voice-call/README.md
@ -98,7 +98,7 @@ See the plugin docs for recommended ranges and production examples:

 ## TTS for calls

-Voice Call uses the core `messages.tts` configuration (OpenAI or ElevenLabs) for
+Voice Call uses the core `messages.tts` configuration for
 streaming speech on calls. Override examples and provider caveats live here:
 `https://docs.openclaw.ai/plugins/voice-call#tts-for-calls`