docs(plugins): document capability ownership model

2026-03-16 18:50:03 -07:00 · 2026-03-16 18:50:03 -07:00 · 6da9ba3267
commit 6da9ba3267
parent 662031a88e
4 changed files with 238 additions and 41 deletions
--- a/docs/plugins/voice-call.md
+++ b/docs/plugins/voice-call.md
@ -204,7 +204,7 @@ Example with a stable public host:
 ## TTS for calls
-Voice Call uses the core `messages.tts` configuration (OpenAI or ElevenLabs) for
+Voice Call uses the core `messages.tts` configuration for
 streaming speech on calls. You can override it under the plugin config with the
 **same shape** — it deep‑merges with `messages.tts`.
@ -222,7 +222,7 @@ streaming speech on calls. You can override it under the plugin config with the
 Notes:
- **Edge TTS is ignored for voice calls** (telephony audio needs PCM; Edge output is unreliable).
+- **Microsoft speech is ignored for voice calls** (telephony audio needs PCM; the current Microsoft transport does not expose telephony PCM output).
 - Core TTS is used when Twilio media streaming is enabled; otherwise calls fall back to provider native voices.
 ### More examples
--- a/docs/tools/plugin.md
+++ b/docs/tools/plugin.md
@ -97,6 +97,76 @@ The important design boundary:
 That split lets OpenClaw validate config, explain missing/disabled plugins, and
 build UI/schema hints before the full runtime is active.
 ## Capability ownership model
 OpenClaw treats a native plugin as the ownership boundary for a **company** or a
 **feature**, not as a grab bag of unrelated integrations.
 That means:
 - a company plugin should usually own all of that company's OpenClaw-facing
  surfaces
 - a feature plugin should usually own the full feature surface it introduces
 - channels should consume shared core capabilities instead of re-implementing
  provider behavior ad hoc
 Examples:
 - the bundled `openai` plugin owns OpenAI model-provider behavior and OpenAI
  speech behavior
 - the bundled `elevenlabs` plugin owns ElevenLabs speech behavior
 - the bundled `microsoft` plugin owns Microsoft speech behavior
 - the `voice-call` plugin is a feature plugin: it owns call transport, tools,
  CLI, routes, and runtime, but it consumes core TTS/STT capability instead of
  inventing a second speech stack
 The intended end state is:
 - OpenAI lives in one plugin even if it spans text models, speech, images, and
  future video
 - another vendor can do the same for its own surface area
 - channels do not care which vendor plugin owns the provider; they consume the
  shared capability contract exposed by core
 This is the key distinction:
 - **plugin** = ownership boundary
 - **capability** = core contract that multiple plugins can implement or consume
 So if OpenClaw adds a new domain such as video, the first question is not
 "which provider should hardcode video handling?" The first question is "what is
 the core video capability contract?" Once that contract exists, vendor plugins
 can register against it and channel/feature plugins can consume it.
 If the capability does not exist yet, the right move is usually:
 1. define the missing capability in core
 2. expose it through the plugin API/runtime in a typed way
 3. wire channels/features against that capability
 4. let vendor plugins register implementations
 This keeps ownership explicit while avoiding core behavior that depends on a
 single vendor or a one-off plugin-specific code path.
 ### Capability layering
 Use this mental model when deciding where code belongs:
 - **core capability layer**: shared orchestration, policy, fallback, config
  merge rules, delivery semantics, and typed contracts
 - **vendor plugin layer**: vendor-specific APIs, auth, model catalogs, speech
  synthesis, image generation, future video backends, usage endpoints
 - **channel/feature plugin layer**: Slack/Discord/voice-call/etc. integration
  that consumes core capabilities and presents them on a surface
 For example, TTS follows this shape:
 - core owns reply-time TTS policy, fallback order, prefs, and channel delivery
 - `openai`, `elevenlabs`, and `microsoft` own synthesis implementations
 - `voice-call` consumes the telephony TTS runtime helper
 That same pattern should be preferred for future capabilities.
 ## Compatible bundles
 OpenClaw also recognizes two compatible external bundle layouts:
@ -193,6 +263,8 @@ Important trust note:
 - Model Studio provider catalog — bundled as `modelstudio` (enabled by default)
 - Moonshot provider runtime — bundled as `moonshot` (enabled by default)
 - NVIDIA provider catalog — bundled as `nvidia` (enabled by default)
 - ElevenLabs speech provider — bundled as `elevenlabs` (enabled by default)
 - Microsoft speech provider — bundled as `microsoft` (enabled by default; legacy `edge` input maps here)
 - OpenAI provider runtime — bundled as `openai` (enabled by default; owns both `openai` and `openai-codex`)
 - OpenCode Go provider capabilities — bundled as `opencode-go` (enabled by default)
 - OpenCode Zen provider capabilities — bundled as `opencode` (enabled by default)
@ -218,6 +290,8 @@ Native OpenClaw plugins can register:
 - Gateway HTTP routes
 - Agent tools
 - CLI commands
 - Speech providers
 - Web search providers
 - Background services
 - Context engines
 - Provider auth flows and model catalogs
@ -229,6 +303,62 @@ Native OpenClaw plugins can register:
 Native OpenClaw plugins run **in‑process** with the Gateway, so treat them as trusted code.
 Tool authoring guide: [Plugin agent tools](/plugins/agent-tools).
 Think of these registrations as **capability claims**. A plugin is not supposed
 to reach into random internals and "just make it work." It should register
 against explicit surfaces that OpenClaw understands, validates, and can expose
 consistently across config, onboarding, status, docs, and runtime behavior.
 ## Contracts and enforcement
 The plugin API surface is intentionally typed and centralized in
 `OpenClawPluginApi`. That contract defines the supported registration points and
 the runtime helpers a plugin may rely on.
 Why this matters:
 - plugin authors get one stable internal standard
 - core can reject duplicate ownership such as two plugins registering the same
  provider id
 - startup can surface actionable diagnostics for malformed registration
 - contract tests can enforce bundled-plugin ownership and prevent silent drift
 There are two layers of enforcement:
 1. **runtime registration enforcement**
   The plugin registry validates registrations as plugins load. Examples:
   duplicate provider ids, duplicate speech provider ids, and malformed
   registrations produce plugin diagnostics instead of undefined behavior.
 2. **contract tests**
   Bundled plugins are captured in contract registries during test runs so
   OpenClaw can assert ownership explicitly. Today this is used for model
   providers, web search providers, and bundled registration ownership.
 The practical effect is that OpenClaw knows, up front, which plugin owns which
 surface. That lets core and channels compose seamlessly because ownership is
 declared, typed, and testable rather than implicit.
 ### What belongs in a contract
 Good plugin contracts are:
 - typed
 - small
 - capability-specific
 - owned by core
 - reusable by multiple plugins
 - consumable by channels/features without vendor knowledge
 Bad plugin contracts are:
 - vendor-specific policy hidden in core
 - one-off plugin escape hatches that bypass the registry
 - channel code reaching straight into a vendor implementation
 - ad hoc runtime objects that are not part of `OpenClawPluginApi` or
  `api.runtime`
 When in doubt, raise the abstraction level: define the capability first, then
 let plugins plug into it.
 ## Provider runtime hooks
 Provider plugins now have two layers:
@ -530,9 +660,36 @@ const result = await api.runtime.tts.textToSpeechTelephony({
 Notes:
- Uses core `messages.tts` configuration (OpenAI or ElevenLabs).
+- Uses core `messages.tts` configuration and provider selection.
 - Returns PCM audio buffer + sample rate. Plugins must resample/encode for providers.
- Edge TTS is not supported for telephony.
+- OpenAI and ElevenLabs support telephony today. Microsoft does not.
 Plugins can also register speech providers via `api.registerSpeechProvider(...)`.
 ```ts
 api.registerSpeechProvider({
  id: "acme-speech",
  label: "Acme Speech",
  isConfigured: ({ config }) => Boolean(config.messages?.tts),
  synthesize: async (req) => {
    return {
      audioBuffer: Buffer.from([]),
      outputFormat: "mp3",
      fileExtension: ".mp3",
      voiceCompatible: false,
    };
  },
 });
 ```
 Notes:
 - Keep TTS policy, fallback, and reply delivery in core.
 - Use speech providers for vendor-owned synthesis behavior.
 - Legacy Microsoft `edge` input is normalized to the `microsoft` provider id.
 - The preferred ownership model is company-oriented: one vendor plugin can own
  text, speech, image, and future media providers as OpenClaw adds those
  capability contracts.
 For STT/transcription, plugins can call:
@ -1110,12 +1267,49 @@ Plugins export either:
 - `on(...)` for typed lifecycle hooks
 - `registerChannel`
 - `registerProvider`
 - `registerSpeechProvider`
 - `registerWebSearchProvider`
 - `registerHttpRoute`
 - `registerCommand`
 - `registerCli`
 - `registerContextEngine`
 - `registerService`
 In practice, `register(api)` is also where a plugin declares **ownership**.
 That ownership should map cleanly to either:
 - a vendor surface such as OpenAI, ElevenLabs, or Microsoft
 - a feature surface such as Voice Call
 Avoid splitting one vendor's capabilities across unrelated plugins unless there
 is a strong product reason to do so. The default should be one plugin per
 vendor/feature, with core capability contracts separating shared orchestration
 from vendor-specific behavior.
 ## Adding a new capability
 When a plugin needs behavior that does not fit the current API, do not bypass
 the plugin system with a private reach-in. Add the missing capability.
 Recommended sequence:
 1. define the core contract
   Decide what shared behavior core should own: policy, fallback, config merge,
   lifecycle, channel-facing semantics, and runtime helper shape.
 2. add typed plugin registration/runtime surfaces
   Extend `OpenClawPluginApi` and/or `api.runtime` with the smallest useful
   typed seam.
 3. wire core + channel/feature consumers
   Channels and feature plugins should consume the new capability through core,
   not by importing a vendor implementation directly.
 4. register vendor implementations
   Vendor plugins then register their backends against the capability.
 5. add contract coverage
   Add tests so ownership and registration shape stay explicit over time.
 This is how OpenClaw stays opinionated without becoming hardcoded to one
 provider's worldview.
 Context engine plugins can also register a runtime-owned context manager:
 ```ts
--- a/docs/tts.md
+++ b/docs/tts.md
@ -9,26 +9,27 @@ title: "Text-to-Speech"
 # Text-to-speech (TTS)
-OpenClaw can convert outbound replies into audio using ElevenLabs, OpenAI, or Edge TTS.
+OpenClaw can convert outbound replies into audio using ElevenLabs, Microsoft, or OpenAI.
 It works anywhere OpenClaw can send audio; Telegram gets a round voice-note bubble.
 ## Supported services
 - **ElevenLabs** (primary or fallback provider)
 - **Microsoft** (primary or fallback provider; current bundled implementation uses `node-edge-tts`, default when no API keys)
 - **OpenAI** (primary or fallback provider; also used for summaries)
 - **Edge TTS** (primary or fallback provider; uses `node-edge-tts`, default when no API keys)
-### Edge TTS notes
+### Microsoft speech notes
-Edge TTS uses Microsoft Edge's online neural TTS service via the `node-edge-tts`
+The bundled Microsoft speech provider currently uses Microsoft Edge's online
-library. It's a hosted service (not local), uses Microsoft’s endpoints, and does
+neural TTS service via the `node-edge-tts` library. It's a hosted service (not
-not require an API key. `node-edge-tts` exposes speech configuration options and
+local), uses Microsoft endpoints, and does not require an API key.
-output formats, but not all options are supported by the Edge service. citeturn2search0
+`node-edge-tts` exposes speech configuration options and output formats, but
 not all options are supported by the service. Legacy config and directive input
 using `edge` still works and is normalized to `microsoft`.
-Because Edge TTS is a public web service without a published SLA or quota, treat it
+Because this path is a public web service without a published SLA or quota,
-as best-effort. If you need guaranteed limits and support, use OpenAI or ElevenLabs.
+treat it as best-effort. If you need guaranteed limits and support, use OpenAI
-Microsoft's Speech REST API documents a 10‑minute audio limit per request; Edge TTS
+or ElevenLabs.
 does not publish limits, so assume similar or lower limits. citeturn0search3
 ## Optional keys
@ -37,8 +38,9 @@ If you want OpenAI or ElevenLabs:
 - `ELEVENLABS_API_KEY` (or `XI_API_KEY`)
 - `OPENAI_API_KEY`
-Edge TTS does **not** require an API key. If no API keys are found, OpenClaw defaults
+Microsoft speech does **not** require an API key. If no API keys are found,
-to Edge TTS (unless disabled via `messages.tts.edge.enabled=false`).
+OpenClaw defaults to Microsoft (unless disabled via
 `messages.tts.microsoft.enabled=false` or `messages.tts.edge.enabled=false`).
 If multiple providers are configured, the selected provider is used first and the others are fallback options.
 Auto-summary uses the configured `summaryModel` (or `agents.defaults.model.primary`),
@ -58,7 +60,7 @@ so that provider must also be authenticated if you enable summaries.
 No. Auto‑TTS is **off** by default. Enable it in config with
 `messages.tts.auto` or per session with `/tts always` (alias: `/tts on`).
-Edge TTS **is** enabled by default once TTS is on, and is used automatically
+Microsoft speech **is** enabled by default once TTS is on, and is used automatically
 when no OpenAI or ElevenLabs API keys are available.
 ## Config
@ -118,15 +120,15 @@ Full schema is in [Gateway configuration](/gateway/configuration).
 }
 ```
-### Edge TTS primary (no API key)
+### Microsoft primary (no API key)
 ```json5
 {
  messages: {
    tts: {
      auto: "always",
-      provider: "edge",
+      provider: "microsoft",
-      edge: {
+      microsoft: {
        enabled: true,
        voice: "en-US-MichelleNeural",
        lang: "en-US",
@ -139,13 +141,13 @@ Full schema is in [Gateway configuration](/gateway/configuration).
 }
 ```
-### Disable Edge TTS
+### Disable Microsoft speech
 ```json5
 {
  messages: {
    tts: {
-      edge: {
+      microsoft: {
        enabled: false,
      },
    },
@ -205,9 +207,10 @@ Then run:
  - `tagged` only sends audio when the reply includes `[[tts]]` tags.
 - `enabled`: legacy toggle (doctor migrates this to `auto`).
 - `mode`: `"final"` (default) or `"all"` (includes tool/block replies).
- `provider`: `"elevenlabs"`, `"openai"`, or `"edge"` (fallback is automatic).
+- `provider`: speech provider id such as `"elevenlabs"`, `"microsoft"`, or `"openai"` (fallback is automatic).
 - If `provider` is **unset**, OpenClaw prefers `openai` (if key), then `elevenlabs` (if key),
-  otherwise `edge`.
+  otherwise `microsoft`.
 - Legacy `provider: "edge"` still works and is normalized to `microsoft`.
 - `summaryModel`: optional cheap model for auto-summary; defaults to `agents.defaults.model.primary`.
  - Accepts `provider/model` or a configured model alias.
 - `modelOverrides`: allow the model to emit TTS directives (on by default).
@ -227,15 +230,16 @@ Then run:
 - `elevenlabs.applyTextNormalization`: `auto|on|off`
 - `elevenlabs.languageCode`: 2-letter ISO 639-1 (e.g. `en`, `de`)
 - `elevenlabs.seed`: integer `0..4294967295` (best-effort determinism)
- `edge.enabled`: allow Edge TTS usage (default `true`; no API key).
+- `microsoft.enabled`: allow Microsoft speech usage (default `true`; no API key).
- `edge.voice`: Edge neural voice name (e.g. `en-US-MichelleNeural`).
+- `microsoft.voice`: Microsoft neural voice name (e.g. `en-US-MichelleNeural`).
- `edge.lang`: language code (e.g. `en-US`).
+- `microsoft.lang`: language code (e.g. `en-US`).
- `edge.outputFormat`: Edge output format (e.g. `audio-24khz-48kbitrate-mono-mp3`).
+- `microsoft.outputFormat`: Microsoft output format (e.g. `audio-24khz-48kbitrate-mono-mp3`).
-  - See Microsoft Speech output formats for valid values; not all formats are supported by Edge.
+  - See Microsoft Speech output formats for valid values; not all formats are supported by the bundled Edge-backed transport.
- `edge.rate` / `edge.pitch` / `edge.volume`: percent strings (e.g. `+10%`, `-5%`).
+- `microsoft.rate` / `microsoft.pitch` / `microsoft.volume`: percent strings (e.g. `+10%`, `-5%`).
- `edge.saveSubtitles`: write JSON subtitles alongside the audio file.
+- `microsoft.saveSubtitles`: write JSON subtitles alongside the audio file.
- `edge.proxy`: proxy URL for Edge TTS requests.
+- `microsoft.proxy`: proxy URL for Microsoft speech requests.
- `edge.timeoutMs`: request timeout override (ms).
+- `microsoft.timeoutMs`: request timeout override (ms).
 - `edge.*`: legacy alias for the same Microsoft settings.
 ## Model-driven overrides (default on)
@ -260,7 +264,7 @@ Here you go.
 Available directive keys (when enabled):
- `provider` (`openai` | `elevenlabs` | `edge`, requires `allowProvider: true`)
+- `provider` (registered speech provider id, for example `openai`, `elevenlabs`, or `microsoft`; requires `allowProvider: true`)
 - `voice` (OpenAI voice) or `voiceId` (ElevenLabs)
 - `model` (OpenAI TTS model or ElevenLabs model id)
 - `stability`, `similarityBoost`, `style`, `speed`, `useSpeakerBoost`
@ -319,13 +323,12 @@ These override `messages.tts.*` for that host.
  - 48kHz / 64kbps is a good voice-note tradeoff and required for the round bubble.
 - **Other channels**: MP3 (`mp3_44100_128` from ElevenLabs, `mp3` from OpenAI).
  - 44.1kHz / 128kbps is the default balance for speech clarity.
- **Edge TTS**: uses `edge.outputFormat` (default `audio-24khz-48kbitrate-mono-mp3`).
+- **Microsoft**: uses `microsoft.outputFormat` (default `audio-24khz-48kbitrate-mono-mp3`).
-  - `node-edge-tts` accepts an `outputFormat`, but not all formats are available
+  - The bundled transport accepts an `outputFormat`, but not all formats are available from the service.
-    from the Edge service. citeturn2search0
+  - Output format values follow Microsoft Speech output formats (including Ogg/WebM Opus).
  - Output format values follow Microsoft Speech output formats (including Ogg/WebM Opus). citeturn1search0
  - Telegram `sendVoice` accepts OGG/MP3/M4A; use OpenAI/ElevenLabs if you need
    guaranteed Opus voice notes. citeturn1search1
-  - If the configured Edge output format fails, OpenClaw retries with MP3.
+  - If the configured Microsoft output format fails, OpenClaw retries with MP3.
 OpenAI/ElevenLabs formats are fixed; Telegram expects Opus for voice-note UX.
--- a/extensions/voice-call/README.md
+++ b/extensions/voice-call/README.md
@ -98,7 +98,7 @@ See the plugin docs for recommended ranges and production examples:
 ## TTS for calls
-Voice Call uses the core `messages.tts` configuration (OpenAI or ElevenLabs) for
+Voice Call uses the core `messages.tts` configuration for
 streaming speech on calls. Override examples and provider caveats live here:
 `https://docs.openclaw.ai/plugins/voice-call#tts-for-calls`