From 6da9ba3267b96685150adad46e403b3c2505400e Mon Sep 17 00:00:00 2001 From: Peter Steinberger Date: Mon, 16 Mar 2026 18:50:03 -0700 Subject: [PATCH] docs(plugins): document capability ownership model --- docs/plugins/voice-call.md | 4 +- docs/tools/plugin.md | 198 +++++++++++++++++++++++++++++++- docs/tts.md | 75 ++++++------ extensions/voice-call/README.md | 2 +- 4 files changed, 238 insertions(+), 41 deletions(-) diff --git a/docs/plugins/voice-call.md b/docs/plugins/voice-call.md index 14198fdba36..531b6c48595 100644 --- a/docs/plugins/voice-call.md +++ b/docs/plugins/voice-call.md @@ -204,7 +204,7 @@ Example with a stable public host: ## TTS for calls -Voice Call uses the core `messages.tts` configuration (OpenAI or ElevenLabs) for +Voice Call uses the core `messages.tts` configuration for streaming speech on calls. You can override it under the plugin config with the **same shape** — it deep‑merges with `messages.tts`. @@ -222,7 +222,7 @@ streaming speech on calls. You can override it under the plugin config with the Notes: -- **Edge TTS is ignored for voice calls** (telephony audio needs PCM; Edge output is unreliable). +- **Microsoft speech is ignored for voice calls** (telephony audio needs PCM; the current Microsoft transport does not expose telephony PCM output). - Core TTS is used when Twilio media streaming is enabled; otherwise calls fall back to provider native voices. ### More examples diff --git a/docs/tools/plugin.md b/docs/tools/plugin.md index ec0247c8d72..3e53c5e205e 100644 --- a/docs/tools/plugin.md +++ b/docs/tools/plugin.md @@ -97,6 +97,76 @@ The important design boundary: That split lets OpenClaw validate config, explain missing/disabled plugins, and build UI/schema hints before the full runtime is active. +## Capability ownership model + +OpenClaw treats a native plugin as the ownership boundary for a **company** or a +**feature**, not as a grab bag of unrelated integrations. + +That means: + +- a company plugin should usually own all of that company's OpenClaw-facing + surfaces +- a feature plugin should usually own the full feature surface it introduces +- channels should consume shared core capabilities instead of re-implementing + provider behavior ad hoc + +Examples: + +- the bundled `openai` plugin owns OpenAI model-provider behavior and OpenAI + speech behavior +- the bundled `elevenlabs` plugin owns ElevenLabs speech behavior +- the bundled `microsoft` plugin owns Microsoft speech behavior +- the `voice-call` plugin is a feature plugin: it owns call transport, tools, + CLI, routes, and runtime, but it consumes core TTS/STT capability instead of + inventing a second speech stack + +The intended end state is: + +- OpenAI lives in one plugin even if it spans text models, speech, images, and + future video +- another vendor can do the same for its own surface area +- channels do not care which vendor plugin owns the provider; they consume the + shared capability contract exposed by core + +This is the key distinction: + +- **plugin** = ownership boundary +- **capability** = core contract that multiple plugins can implement or consume + +So if OpenClaw adds a new domain such as video, the first question is not +"which provider should hardcode video handling?" The first question is "what is +the core video capability contract?" Once that contract exists, vendor plugins +can register against it and channel/feature plugins can consume it. + +If the capability does not exist yet, the right move is usually: + +1. define the missing capability in core +2. expose it through the plugin API/runtime in a typed way +3. wire channels/features against that capability +4. let vendor plugins register implementations + +This keeps ownership explicit while avoiding core behavior that depends on a +single vendor or a one-off plugin-specific code path. + +### Capability layering + +Use this mental model when deciding where code belongs: + +- **core capability layer**: shared orchestration, policy, fallback, config + merge rules, delivery semantics, and typed contracts +- **vendor plugin layer**: vendor-specific APIs, auth, model catalogs, speech + synthesis, image generation, future video backends, usage endpoints +- **channel/feature plugin layer**: Slack/Discord/voice-call/etc. integration + that consumes core capabilities and presents them on a surface + +For example, TTS follows this shape: + +- core owns reply-time TTS policy, fallback order, prefs, and channel delivery +- `openai`, `elevenlabs`, and `microsoft` own synthesis implementations +- `voice-call` consumes the telephony TTS runtime helper + +That same pattern should be preferred for future capabilities. + ## Compatible bundles OpenClaw also recognizes two compatible external bundle layouts: @@ -193,6 +263,8 @@ Important trust note: - Model Studio provider catalog — bundled as `modelstudio` (enabled by default) - Moonshot provider runtime — bundled as `moonshot` (enabled by default) - NVIDIA provider catalog — bundled as `nvidia` (enabled by default) +- ElevenLabs speech provider — bundled as `elevenlabs` (enabled by default) +- Microsoft speech provider — bundled as `microsoft` (enabled by default; legacy `edge` input maps here) - OpenAI provider runtime — bundled as `openai` (enabled by default; owns both `openai` and `openai-codex`) - OpenCode Go provider capabilities — bundled as `opencode-go` (enabled by default) - OpenCode Zen provider capabilities — bundled as `opencode` (enabled by default) @@ -218,6 +290,8 @@ Native OpenClaw plugins can register: - Gateway HTTP routes - Agent tools - CLI commands +- Speech providers +- Web search providers - Background services - Context engines - Provider auth flows and model catalogs @@ -229,6 +303,62 @@ Native OpenClaw plugins can register: Native OpenClaw plugins run **in‑process** with the Gateway, so treat them as trusted code. Tool authoring guide: [Plugin agent tools](/plugins/agent-tools). +Think of these registrations as **capability claims**. A plugin is not supposed +to reach into random internals and "just make it work." It should register +against explicit surfaces that OpenClaw understands, validates, and can expose +consistently across config, onboarding, status, docs, and runtime behavior. + +## Contracts and enforcement + +The plugin API surface is intentionally typed and centralized in +`OpenClawPluginApi`. That contract defines the supported registration points and +the runtime helpers a plugin may rely on. + +Why this matters: + +- plugin authors get one stable internal standard +- core can reject duplicate ownership such as two plugins registering the same + provider id +- startup can surface actionable diagnostics for malformed registration +- contract tests can enforce bundled-plugin ownership and prevent silent drift + +There are two layers of enforcement: + +1. **runtime registration enforcement** + The plugin registry validates registrations as plugins load. Examples: + duplicate provider ids, duplicate speech provider ids, and malformed + registrations produce plugin diagnostics instead of undefined behavior. +2. **contract tests** + Bundled plugins are captured in contract registries during test runs so + OpenClaw can assert ownership explicitly. Today this is used for model + providers, web search providers, and bundled registration ownership. + +The practical effect is that OpenClaw knows, up front, which plugin owns which +surface. That lets core and channels compose seamlessly because ownership is +declared, typed, and testable rather than implicit. + +### What belongs in a contract + +Good plugin contracts are: + +- typed +- small +- capability-specific +- owned by core +- reusable by multiple plugins +- consumable by channels/features without vendor knowledge + +Bad plugin contracts are: + +- vendor-specific policy hidden in core +- one-off plugin escape hatches that bypass the registry +- channel code reaching straight into a vendor implementation +- ad hoc runtime objects that are not part of `OpenClawPluginApi` or + `api.runtime` + +When in doubt, raise the abstraction level: define the capability first, then +let plugins plug into it. + ## Provider runtime hooks Provider plugins now have two layers: @@ -530,9 +660,36 @@ const result = await api.runtime.tts.textToSpeechTelephony({ Notes: -- Uses core `messages.tts` configuration (OpenAI or ElevenLabs). +- Uses core `messages.tts` configuration and provider selection. - Returns PCM audio buffer + sample rate. Plugins must resample/encode for providers. -- Edge TTS is not supported for telephony. +- OpenAI and ElevenLabs support telephony today. Microsoft does not. + +Plugins can also register speech providers via `api.registerSpeechProvider(...)`. + +```ts +api.registerSpeechProvider({ + id: "acme-speech", + label: "Acme Speech", + isConfigured: ({ config }) => Boolean(config.messages?.tts), + synthesize: async (req) => { + return { + audioBuffer: Buffer.from([]), + outputFormat: "mp3", + fileExtension: ".mp3", + voiceCompatible: false, + }; + }, +}); +``` + +Notes: + +- Keep TTS policy, fallback, and reply delivery in core. +- Use speech providers for vendor-owned synthesis behavior. +- Legacy Microsoft `edge` input is normalized to the `microsoft` provider id. +- The preferred ownership model is company-oriented: one vendor plugin can own + text, speech, image, and future media providers as OpenClaw adds those + capability contracts. For STT/transcription, plugins can call: @@ -1110,12 +1267,49 @@ Plugins export either: - `on(...)` for typed lifecycle hooks - `registerChannel` - `registerProvider` +- `registerSpeechProvider` +- `registerWebSearchProvider` - `registerHttpRoute` - `registerCommand` - `registerCli` - `registerContextEngine` - `registerService` +In practice, `register(api)` is also where a plugin declares **ownership**. +That ownership should map cleanly to either: + +- a vendor surface such as OpenAI, ElevenLabs, or Microsoft +- a feature surface such as Voice Call + +Avoid splitting one vendor's capabilities across unrelated plugins unless there +is a strong product reason to do so. The default should be one plugin per +vendor/feature, with core capability contracts separating shared orchestration +from vendor-specific behavior. + +## Adding a new capability + +When a plugin needs behavior that does not fit the current API, do not bypass +the plugin system with a private reach-in. Add the missing capability. + +Recommended sequence: + +1. define the core contract + Decide what shared behavior core should own: policy, fallback, config merge, + lifecycle, channel-facing semantics, and runtime helper shape. +2. add typed plugin registration/runtime surfaces + Extend `OpenClawPluginApi` and/or `api.runtime` with the smallest useful + typed seam. +3. wire core + channel/feature consumers + Channels and feature plugins should consume the new capability through core, + not by importing a vendor implementation directly. +4. register vendor implementations + Vendor plugins then register their backends against the capability. +5. add contract coverage + Add tests so ownership and registration shape stay explicit over time. + +This is how OpenClaw stays opinionated without becoming hardcoded to one +provider's worldview. + Context engine plugins can also register a runtime-owned context manager: ```ts diff --git a/docs/tts.md b/docs/tts.md index 682bbfbd53a..4fe0da77e0a 100644 --- a/docs/tts.md +++ b/docs/tts.md @@ -9,26 +9,27 @@ title: "Text-to-Speech" # Text-to-speech (TTS) -OpenClaw can convert outbound replies into audio using ElevenLabs, OpenAI, or Edge TTS. +OpenClaw can convert outbound replies into audio using ElevenLabs, Microsoft, or OpenAI. It works anywhere OpenClaw can send audio; Telegram gets a round voice-note bubble. ## Supported services - **ElevenLabs** (primary or fallback provider) +- **Microsoft** (primary or fallback provider; current bundled implementation uses `node-edge-tts`, default when no API keys) - **OpenAI** (primary or fallback provider; also used for summaries) -- **Edge TTS** (primary or fallback provider; uses `node-edge-tts`, default when no API keys) -### Edge TTS notes +### Microsoft speech notes -Edge TTS uses Microsoft Edge's online neural TTS service via the `node-edge-tts` -library. It's a hosted service (not local), uses Microsoft’s endpoints, and does -not require an API key. `node-edge-tts` exposes speech configuration options and -output formats, but not all options are supported by the Edge service. citeturn2search0 +The bundled Microsoft speech provider currently uses Microsoft Edge's online +neural TTS service via the `node-edge-tts` library. It's a hosted service (not +local), uses Microsoft endpoints, and does not require an API key. +`node-edge-tts` exposes speech configuration options and output formats, but +not all options are supported by the service. Legacy config and directive input +using `edge` still works and is normalized to `microsoft`. -Because Edge TTS is a public web service without a published SLA or quota, treat it -as best-effort. If you need guaranteed limits and support, use OpenAI or ElevenLabs. -Microsoft's Speech REST API documents a 10‑minute audio limit per request; Edge TTS -does not publish limits, so assume similar or lower limits. citeturn0search3 +Because this path is a public web service without a published SLA or quota, +treat it as best-effort. If you need guaranteed limits and support, use OpenAI +or ElevenLabs. ## Optional keys @@ -37,8 +38,9 @@ If you want OpenAI or ElevenLabs: - `ELEVENLABS_API_KEY` (or `XI_API_KEY`) - `OPENAI_API_KEY` -Edge TTS does **not** require an API key. If no API keys are found, OpenClaw defaults -to Edge TTS (unless disabled via `messages.tts.edge.enabled=false`). +Microsoft speech does **not** require an API key. If no API keys are found, +OpenClaw defaults to Microsoft (unless disabled via +`messages.tts.microsoft.enabled=false` or `messages.tts.edge.enabled=false`). If multiple providers are configured, the selected provider is used first and the others are fallback options. Auto-summary uses the configured `summaryModel` (or `agents.defaults.model.primary`), @@ -58,7 +60,7 @@ so that provider must also be authenticated if you enable summaries. No. Auto‑TTS is **off** by default. Enable it in config with `messages.tts.auto` or per session with `/tts always` (alias: `/tts on`). -Edge TTS **is** enabled by default once TTS is on, and is used automatically +Microsoft speech **is** enabled by default once TTS is on, and is used automatically when no OpenAI or ElevenLabs API keys are available. ## Config @@ -118,15 +120,15 @@ Full schema is in [Gateway configuration](/gateway/configuration). } ``` -### Edge TTS primary (no API key) +### Microsoft primary (no API key) ```json5 { messages: { tts: { auto: "always", - provider: "edge", - edge: { + provider: "microsoft", + microsoft: { enabled: true, voice: "en-US-MichelleNeural", lang: "en-US", @@ -139,13 +141,13 @@ Full schema is in [Gateway configuration](/gateway/configuration). } ``` -### Disable Edge TTS +### Disable Microsoft speech ```json5 { messages: { tts: { - edge: { + microsoft: { enabled: false, }, }, @@ -205,9 +207,10 @@ Then run: - `tagged` only sends audio when the reply includes `[[tts]]` tags. - `enabled`: legacy toggle (doctor migrates this to `auto`). - `mode`: `"final"` (default) or `"all"` (includes tool/block replies). -- `provider`: `"elevenlabs"`, `"openai"`, or `"edge"` (fallback is automatic). +- `provider`: speech provider id such as `"elevenlabs"`, `"microsoft"`, or `"openai"` (fallback is automatic). - If `provider` is **unset**, OpenClaw prefers `openai` (if key), then `elevenlabs` (if key), - otherwise `edge`. + otherwise `microsoft`. +- Legacy `provider: "edge"` still works and is normalized to `microsoft`. - `summaryModel`: optional cheap model for auto-summary; defaults to `agents.defaults.model.primary`. - Accepts `provider/model` or a configured model alias. - `modelOverrides`: allow the model to emit TTS directives (on by default). @@ -227,15 +230,16 @@ Then run: - `elevenlabs.applyTextNormalization`: `auto|on|off` - `elevenlabs.languageCode`: 2-letter ISO 639-1 (e.g. `en`, `de`) - `elevenlabs.seed`: integer `0..4294967295` (best-effort determinism) -- `edge.enabled`: allow Edge TTS usage (default `true`; no API key). -- `edge.voice`: Edge neural voice name (e.g. `en-US-MichelleNeural`). -- `edge.lang`: language code (e.g. `en-US`). -- `edge.outputFormat`: Edge output format (e.g. `audio-24khz-48kbitrate-mono-mp3`). - - See Microsoft Speech output formats for valid values; not all formats are supported by Edge. -- `edge.rate` / `edge.pitch` / `edge.volume`: percent strings (e.g. `+10%`, `-5%`). -- `edge.saveSubtitles`: write JSON subtitles alongside the audio file. -- `edge.proxy`: proxy URL for Edge TTS requests. -- `edge.timeoutMs`: request timeout override (ms). +- `microsoft.enabled`: allow Microsoft speech usage (default `true`; no API key). +- `microsoft.voice`: Microsoft neural voice name (e.g. `en-US-MichelleNeural`). +- `microsoft.lang`: language code (e.g. `en-US`). +- `microsoft.outputFormat`: Microsoft output format (e.g. `audio-24khz-48kbitrate-mono-mp3`). + - See Microsoft Speech output formats for valid values; not all formats are supported by the bundled Edge-backed transport. +- `microsoft.rate` / `microsoft.pitch` / `microsoft.volume`: percent strings (e.g. `+10%`, `-5%`). +- `microsoft.saveSubtitles`: write JSON subtitles alongside the audio file. +- `microsoft.proxy`: proxy URL for Microsoft speech requests. +- `microsoft.timeoutMs`: request timeout override (ms). +- `edge.*`: legacy alias for the same Microsoft settings. ## Model-driven overrides (default on) @@ -260,7 +264,7 @@ Here you go. Available directive keys (when enabled): -- `provider` (`openai` | `elevenlabs` | `edge`, requires `allowProvider: true`) +- `provider` (registered speech provider id, for example `openai`, `elevenlabs`, or `microsoft`; requires `allowProvider: true`) - `voice` (OpenAI voice) or `voiceId` (ElevenLabs) - `model` (OpenAI TTS model or ElevenLabs model id) - `stability`, `similarityBoost`, `style`, `speed`, `useSpeakerBoost` @@ -319,13 +323,12 @@ These override `messages.tts.*` for that host. - 48kHz / 64kbps is a good voice-note tradeoff and required for the round bubble. - **Other channels**: MP3 (`mp3_44100_128` from ElevenLabs, `mp3` from OpenAI). - 44.1kHz / 128kbps is the default balance for speech clarity. -- **Edge TTS**: uses `edge.outputFormat` (default `audio-24khz-48kbitrate-mono-mp3`). - - `node-edge-tts` accepts an `outputFormat`, but not all formats are available - from the Edge service. citeturn2search0 - - Output format values follow Microsoft Speech output formats (including Ogg/WebM Opus). citeturn1search0 +- **Microsoft**: uses `microsoft.outputFormat` (default `audio-24khz-48kbitrate-mono-mp3`). + - The bundled transport accepts an `outputFormat`, but not all formats are available from the service. + - Output format values follow Microsoft Speech output formats (including Ogg/WebM Opus). - Telegram `sendVoice` accepts OGG/MP3/M4A; use OpenAI/ElevenLabs if you need guaranteed Opus voice notes. citeturn1search1 - - If the configured Edge output format fails, OpenClaw retries with MP3. + - If the configured Microsoft output format fails, OpenClaw retries with MP3. OpenAI/ElevenLabs formats are fixed; Telegram expects Opus for voice-note UX. diff --git a/extensions/voice-call/README.md b/extensions/voice-call/README.md index fe228537ee8..36ab127875e 100644 --- a/extensions/voice-call/README.md +++ b/extensions/voice-call/README.md @@ -98,7 +98,7 @@ See the plugin docs for recommended ranges and production examples: ## TTS for calls -Voice Call uses the core `messages.tts` configuration (OpenAI or ElevenLabs) for +Voice Call uses the core `messages.tts` configuration for streaming speech on calls. Override examples and provider caveats live here: `https://docs.openclaw.ai/plugins/voice-call#tts-for-calls`