docs(plugins): document capability ownership model

This commit is contained in:
Peter Steinberger 2026-03-16 18:50:03 -07:00
parent 662031a88e
commit 6da9ba3267
No known key found for this signature in database
4 changed files with 238 additions and 41 deletions

View File

@ -204,7 +204,7 @@ Example with a stable public host:
## TTS for calls ## TTS for calls
Voice Call uses the core `messages.tts` configuration (OpenAI or ElevenLabs) for Voice Call uses the core `messages.tts` configuration for
streaming speech on calls. You can override it under the plugin config with the streaming speech on calls. You can override it under the plugin config with the
**same shape** — it deepmerges with `messages.tts`. **same shape** — it deepmerges with `messages.tts`.
@ -222,7 +222,7 @@ streaming speech on calls. You can override it under the plugin config with the
Notes: Notes:
- **Edge TTS is ignored for voice calls** (telephony audio needs PCM; Edge output is unreliable). - **Microsoft speech is ignored for voice calls** (telephony audio needs PCM; the current Microsoft transport does not expose telephony PCM output).
- Core TTS is used when Twilio media streaming is enabled; otherwise calls fall back to provider native voices. - Core TTS is used when Twilio media streaming is enabled; otherwise calls fall back to provider native voices.
### More examples ### More examples

View File

@ -97,6 +97,76 @@ The important design boundary:
That split lets OpenClaw validate config, explain missing/disabled plugins, and That split lets OpenClaw validate config, explain missing/disabled plugins, and
build UI/schema hints before the full runtime is active. build UI/schema hints before the full runtime is active.
## Capability ownership model
OpenClaw treats a native plugin as the ownership boundary for a **company** or a
**feature**, not as a grab bag of unrelated integrations.
That means:
- a company plugin should usually own all of that company's OpenClaw-facing
surfaces
- a feature plugin should usually own the full feature surface it introduces
- channels should consume shared core capabilities instead of re-implementing
provider behavior ad hoc
Examples:
- the bundled `openai` plugin owns OpenAI model-provider behavior and OpenAI
speech behavior
- the bundled `elevenlabs` plugin owns ElevenLabs speech behavior
- the bundled `microsoft` plugin owns Microsoft speech behavior
- the `voice-call` plugin is a feature plugin: it owns call transport, tools,
CLI, routes, and runtime, but it consumes core TTS/STT capability instead of
inventing a second speech stack
The intended end state is:
- OpenAI lives in one plugin even if it spans text models, speech, images, and
future video
- another vendor can do the same for its own surface area
- channels do not care which vendor plugin owns the provider; they consume the
shared capability contract exposed by core
This is the key distinction:
- **plugin** = ownership boundary
- **capability** = core contract that multiple plugins can implement or consume
So if OpenClaw adds a new domain such as video, the first question is not
"which provider should hardcode video handling?" The first question is "what is
the core video capability contract?" Once that contract exists, vendor plugins
can register against it and channel/feature plugins can consume it.
If the capability does not exist yet, the right move is usually:
1. define the missing capability in core
2. expose it through the plugin API/runtime in a typed way
3. wire channels/features against that capability
4. let vendor plugins register implementations
This keeps ownership explicit while avoiding core behavior that depends on a
single vendor or a one-off plugin-specific code path.
### Capability layering
Use this mental model when deciding where code belongs:
- **core capability layer**: shared orchestration, policy, fallback, config
merge rules, delivery semantics, and typed contracts
- **vendor plugin layer**: vendor-specific APIs, auth, model catalogs, speech
synthesis, image generation, future video backends, usage endpoints
- **channel/feature plugin layer**: Slack/Discord/voice-call/etc. integration
that consumes core capabilities and presents them on a surface
For example, TTS follows this shape:
- core owns reply-time TTS policy, fallback order, prefs, and channel delivery
- `openai`, `elevenlabs`, and `microsoft` own synthesis implementations
- `voice-call` consumes the telephony TTS runtime helper
That same pattern should be preferred for future capabilities.
## Compatible bundles ## Compatible bundles
OpenClaw also recognizes two compatible external bundle layouts: OpenClaw also recognizes two compatible external bundle layouts:
@ -193,6 +263,8 @@ Important trust note:
- Model Studio provider catalog — bundled as `modelstudio` (enabled by default) - Model Studio provider catalog — bundled as `modelstudio` (enabled by default)
- Moonshot provider runtime — bundled as `moonshot` (enabled by default) - Moonshot provider runtime — bundled as `moonshot` (enabled by default)
- NVIDIA provider catalog — bundled as `nvidia` (enabled by default) - NVIDIA provider catalog — bundled as `nvidia` (enabled by default)
- ElevenLabs speech provider — bundled as `elevenlabs` (enabled by default)
- Microsoft speech provider — bundled as `microsoft` (enabled by default; legacy `edge` input maps here)
- OpenAI provider runtime — bundled as `openai` (enabled by default; owns both `openai` and `openai-codex`) - OpenAI provider runtime — bundled as `openai` (enabled by default; owns both `openai` and `openai-codex`)
- OpenCode Go provider capabilities — bundled as `opencode-go` (enabled by default) - OpenCode Go provider capabilities — bundled as `opencode-go` (enabled by default)
- OpenCode Zen provider capabilities — bundled as `opencode` (enabled by default) - OpenCode Zen provider capabilities — bundled as `opencode` (enabled by default)
@ -218,6 +290,8 @@ Native OpenClaw plugins can register:
- Gateway HTTP routes - Gateway HTTP routes
- Agent tools - Agent tools
- CLI commands - CLI commands
- Speech providers
- Web search providers
- Background services - Background services
- Context engines - Context engines
- Provider auth flows and model catalogs - Provider auth flows and model catalogs
@ -229,6 +303,62 @@ Native OpenClaw plugins can register:
Native OpenClaw plugins run **inprocess** with the Gateway, so treat them as trusted code. Native OpenClaw plugins run **inprocess** with the Gateway, so treat them as trusted code.
Tool authoring guide: [Plugin agent tools](/plugins/agent-tools). Tool authoring guide: [Plugin agent tools](/plugins/agent-tools).
Think of these registrations as **capability claims**. A plugin is not supposed
to reach into random internals and "just make it work." It should register
against explicit surfaces that OpenClaw understands, validates, and can expose
consistently across config, onboarding, status, docs, and runtime behavior.
## Contracts and enforcement
The plugin API surface is intentionally typed and centralized in
`OpenClawPluginApi`. That contract defines the supported registration points and
the runtime helpers a plugin may rely on.
Why this matters:
- plugin authors get one stable internal standard
- core can reject duplicate ownership such as two plugins registering the same
provider id
- startup can surface actionable diagnostics for malformed registration
- contract tests can enforce bundled-plugin ownership and prevent silent drift
There are two layers of enforcement:
1. **runtime registration enforcement**
The plugin registry validates registrations as plugins load. Examples:
duplicate provider ids, duplicate speech provider ids, and malformed
registrations produce plugin diagnostics instead of undefined behavior.
2. **contract tests**
Bundled plugins are captured in contract registries during test runs so
OpenClaw can assert ownership explicitly. Today this is used for model
providers, web search providers, and bundled registration ownership.
The practical effect is that OpenClaw knows, up front, which plugin owns which
surface. That lets core and channels compose seamlessly because ownership is
declared, typed, and testable rather than implicit.
### What belongs in a contract
Good plugin contracts are:
- typed
- small
- capability-specific
- owned by core
- reusable by multiple plugins
- consumable by channels/features without vendor knowledge
Bad plugin contracts are:
- vendor-specific policy hidden in core
- one-off plugin escape hatches that bypass the registry
- channel code reaching straight into a vendor implementation
- ad hoc runtime objects that are not part of `OpenClawPluginApi` or
`api.runtime`
When in doubt, raise the abstraction level: define the capability first, then
let plugins plug into it.
## Provider runtime hooks ## Provider runtime hooks
Provider plugins now have two layers: Provider plugins now have two layers:
@ -530,9 +660,36 @@ const result = await api.runtime.tts.textToSpeechTelephony({
Notes: Notes:
- Uses core `messages.tts` configuration (OpenAI or ElevenLabs). - Uses core `messages.tts` configuration and provider selection.
- Returns PCM audio buffer + sample rate. Plugins must resample/encode for providers. - Returns PCM audio buffer + sample rate. Plugins must resample/encode for providers.
- Edge TTS is not supported for telephony. - OpenAI and ElevenLabs support telephony today. Microsoft does not.
Plugins can also register speech providers via `api.registerSpeechProvider(...)`.
```ts
api.registerSpeechProvider({
id: "acme-speech",
label: "Acme Speech",
isConfigured: ({ config }) => Boolean(config.messages?.tts),
synthesize: async (req) => {
return {
audioBuffer: Buffer.from([]),
outputFormat: "mp3",
fileExtension: ".mp3",
voiceCompatible: false,
};
},
});
```
Notes:
- Keep TTS policy, fallback, and reply delivery in core.
- Use speech providers for vendor-owned synthesis behavior.
- Legacy Microsoft `edge` input is normalized to the `microsoft` provider id.
- The preferred ownership model is company-oriented: one vendor plugin can own
text, speech, image, and future media providers as OpenClaw adds those
capability contracts.
For STT/transcription, plugins can call: For STT/transcription, plugins can call:
@ -1110,12 +1267,49 @@ Plugins export either:
- `on(...)` for typed lifecycle hooks - `on(...)` for typed lifecycle hooks
- `registerChannel` - `registerChannel`
- `registerProvider` - `registerProvider`
- `registerSpeechProvider`
- `registerWebSearchProvider`
- `registerHttpRoute` - `registerHttpRoute`
- `registerCommand` - `registerCommand`
- `registerCli` - `registerCli`
- `registerContextEngine` - `registerContextEngine`
- `registerService` - `registerService`
In practice, `register(api)` is also where a plugin declares **ownership**.
That ownership should map cleanly to either:
- a vendor surface such as OpenAI, ElevenLabs, or Microsoft
- a feature surface such as Voice Call
Avoid splitting one vendor's capabilities across unrelated plugins unless there
is a strong product reason to do so. The default should be one plugin per
vendor/feature, with core capability contracts separating shared orchestration
from vendor-specific behavior.
## Adding a new capability
When a plugin needs behavior that does not fit the current API, do not bypass
the plugin system with a private reach-in. Add the missing capability.
Recommended sequence:
1. define the core contract
Decide what shared behavior core should own: policy, fallback, config merge,
lifecycle, channel-facing semantics, and runtime helper shape.
2. add typed plugin registration/runtime surfaces
Extend `OpenClawPluginApi` and/or `api.runtime` with the smallest useful
typed seam.
3. wire core + channel/feature consumers
Channels and feature plugins should consume the new capability through core,
not by importing a vendor implementation directly.
4. register vendor implementations
Vendor plugins then register their backends against the capability.
5. add contract coverage
Add tests so ownership and registration shape stay explicit over time.
This is how OpenClaw stays opinionated without becoming hardcoded to one
provider's worldview.
Context engine plugins can also register a runtime-owned context manager: Context engine plugins can also register a runtime-owned context manager:
```ts ```ts

View File

@ -9,26 +9,27 @@ title: "Text-to-Speech"
# Text-to-speech (TTS) # Text-to-speech (TTS)
OpenClaw can convert outbound replies into audio using ElevenLabs, OpenAI, or Edge TTS. OpenClaw can convert outbound replies into audio using ElevenLabs, Microsoft, or OpenAI.
It works anywhere OpenClaw can send audio; Telegram gets a round voice-note bubble. It works anywhere OpenClaw can send audio; Telegram gets a round voice-note bubble.
## Supported services ## Supported services
- **ElevenLabs** (primary or fallback provider) - **ElevenLabs** (primary or fallback provider)
- **Microsoft** (primary or fallback provider; current bundled implementation uses `node-edge-tts`, default when no API keys)
- **OpenAI** (primary or fallback provider; also used for summaries) - **OpenAI** (primary or fallback provider; also used for summaries)
- **Edge TTS** (primary or fallback provider; uses `node-edge-tts`, default when no API keys)
### Edge TTS notes ### Microsoft speech notes
Edge TTS uses Microsoft Edge's online neural TTS service via the `node-edge-tts` The bundled Microsoft speech provider currently uses Microsoft Edge's online
library. It's a hosted service (not local), uses Microsofts endpoints, and does neural TTS service via the `node-edge-tts` library. It's a hosted service (not
not require an API key. `node-edge-tts` exposes speech configuration options and local), uses Microsoft endpoints, and does not require an API key.
output formats, but not all options are supported by the Edge service. citeturn2search0 `node-edge-tts` exposes speech configuration options and output formats, but
not all options are supported by the service. Legacy config and directive input
using `edge` still works and is normalized to `microsoft`.
Because Edge TTS is a public web service without a published SLA or quota, treat it Because this path is a public web service without a published SLA or quota,
as best-effort. If you need guaranteed limits and support, use OpenAI or ElevenLabs. treat it as best-effort. If you need guaranteed limits and support, use OpenAI
Microsoft's Speech REST API documents a 10minute audio limit per request; Edge TTS or ElevenLabs.
does not publish limits, so assume similar or lower limits. citeturn0search3
## Optional keys ## Optional keys
@ -37,8 +38,9 @@ If you want OpenAI or ElevenLabs:
- `ELEVENLABS_API_KEY` (or `XI_API_KEY`) - `ELEVENLABS_API_KEY` (or `XI_API_KEY`)
- `OPENAI_API_KEY` - `OPENAI_API_KEY`
Edge TTS does **not** require an API key. If no API keys are found, OpenClaw defaults Microsoft speech does **not** require an API key. If no API keys are found,
to Edge TTS (unless disabled via `messages.tts.edge.enabled=false`). OpenClaw defaults to Microsoft (unless disabled via
`messages.tts.microsoft.enabled=false` or `messages.tts.edge.enabled=false`).
If multiple providers are configured, the selected provider is used first and the others are fallback options. If multiple providers are configured, the selected provider is used first and the others are fallback options.
Auto-summary uses the configured `summaryModel` (or `agents.defaults.model.primary`), Auto-summary uses the configured `summaryModel` (or `agents.defaults.model.primary`),
@ -58,7 +60,7 @@ so that provider must also be authenticated if you enable summaries.
No. AutoTTS is **off** by default. Enable it in config with No. AutoTTS is **off** by default. Enable it in config with
`messages.tts.auto` or per session with `/tts always` (alias: `/tts on`). `messages.tts.auto` or per session with `/tts always` (alias: `/tts on`).
Edge TTS **is** enabled by default once TTS is on, and is used automatically Microsoft speech **is** enabled by default once TTS is on, and is used automatically
when no OpenAI or ElevenLabs API keys are available. when no OpenAI or ElevenLabs API keys are available.
## Config ## Config
@ -118,15 +120,15 @@ Full schema is in [Gateway configuration](/gateway/configuration).
} }
``` ```
### Edge TTS primary (no API key) ### Microsoft primary (no API key)
```json5 ```json5
{ {
messages: { messages: {
tts: { tts: {
auto: "always", auto: "always",
provider: "edge", provider: "microsoft",
edge: { microsoft: {
enabled: true, enabled: true,
voice: "en-US-MichelleNeural", voice: "en-US-MichelleNeural",
lang: "en-US", lang: "en-US",
@ -139,13 +141,13 @@ Full schema is in [Gateway configuration](/gateway/configuration).
} }
``` ```
### Disable Edge TTS ### Disable Microsoft speech
```json5 ```json5
{ {
messages: { messages: {
tts: { tts: {
edge: { microsoft: {
enabled: false, enabled: false,
}, },
}, },
@ -205,9 +207,10 @@ Then run:
- `tagged` only sends audio when the reply includes `[[tts]]` tags. - `tagged` only sends audio when the reply includes `[[tts]]` tags.
- `enabled`: legacy toggle (doctor migrates this to `auto`). - `enabled`: legacy toggle (doctor migrates this to `auto`).
- `mode`: `"final"` (default) or `"all"` (includes tool/block replies). - `mode`: `"final"` (default) or `"all"` (includes tool/block replies).
- `provider`: `"elevenlabs"`, `"openai"`, or `"edge"` (fallback is automatic). - `provider`: speech provider id such as `"elevenlabs"`, `"microsoft"`, or `"openai"` (fallback is automatic).
- If `provider` is **unset**, OpenClaw prefers `openai` (if key), then `elevenlabs` (if key), - If `provider` is **unset**, OpenClaw prefers `openai` (if key), then `elevenlabs` (if key),
otherwise `edge`. otherwise `microsoft`.
- Legacy `provider: "edge"` still works and is normalized to `microsoft`.
- `summaryModel`: optional cheap model for auto-summary; defaults to `agents.defaults.model.primary`. - `summaryModel`: optional cheap model for auto-summary; defaults to `agents.defaults.model.primary`.
- Accepts `provider/model` or a configured model alias. - Accepts `provider/model` or a configured model alias.
- `modelOverrides`: allow the model to emit TTS directives (on by default). - `modelOverrides`: allow the model to emit TTS directives (on by default).
@ -227,15 +230,16 @@ Then run:
- `elevenlabs.applyTextNormalization`: `auto|on|off` - `elevenlabs.applyTextNormalization`: `auto|on|off`
- `elevenlabs.languageCode`: 2-letter ISO 639-1 (e.g. `en`, `de`) - `elevenlabs.languageCode`: 2-letter ISO 639-1 (e.g. `en`, `de`)
- `elevenlabs.seed`: integer `0..4294967295` (best-effort determinism) - `elevenlabs.seed`: integer `0..4294967295` (best-effort determinism)
- `edge.enabled`: allow Edge TTS usage (default `true`; no API key). - `microsoft.enabled`: allow Microsoft speech usage (default `true`; no API key).
- `edge.voice`: Edge neural voice name (e.g. `en-US-MichelleNeural`). - `microsoft.voice`: Microsoft neural voice name (e.g. `en-US-MichelleNeural`).
- `edge.lang`: language code (e.g. `en-US`). - `microsoft.lang`: language code (e.g. `en-US`).
- `edge.outputFormat`: Edge output format (e.g. `audio-24khz-48kbitrate-mono-mp3`). - `microsoft.outputFormat`: Microsoft output format (e.g. `audio-24khz-48kbitrate-mono-mp3`).
- See Microsoft Speech output formats for valid values; not all formats are supported by Edge. - See Microsoft Speech output formats for valid values; not all formats are supported by the bundled Edge-backed transport.
- `edge.rate` / `edge.pitch` / `edge.volume`: percent strings (e.g. `+10%`, `-5%`). - `microsoft.rate` / `microsoft.pitch` / `microsoft.volume`: percent strings (e.g. `+10%`, `-5%`).
- `edge.saveSubtitles`: write JSON subtitles alongside the audio file. - `microsoft.saveSubtitles`: write JSON subtitles alongside the audio file.
- `edge.proxy`: proxy URL for Edge TTS requests. - `microsoft.proxy`: proxy URL for Microsoft speech requests.
- `edge.timeoutMs`: request timeout override (ms). - `microsoft.timeoutMs`: request timeout override (ms).
- `edge.*`: legacy alias for the same Microsoft settings.
## Model-driven overrides (default on) ## Model-driven overrides (default on)
@ -260,7 +264,7 @@ Here you go.
Available directive keys (when enabled): Available directive keys (when enabled):
- `provider` (`openai` | `elevenlabs` | `edge`, requires `allowProvider: true`) - `provider` (registered speech provider id, for example `openai`, `elevenlabs`, or `microsoft`; requires `allowProvider: true`)
- `voice` (OpenAI voice) or `voiceId` (ElevenLabs) - `voice` (OpenAI voice) or `voiceId` (ElevenLabs)
- `model` (OpenAI TTS model or ElevenLabs model id) - `model` (OpenAI TTS model or ElevenLabs model id)
- `stability`, `similarityBoost`, `style`, `speed`, `useSpeakerBoost` - `stability`, `similarityBoost`, `style`, `speed`, `useSpeakerBoost`
@ -319,13 +323,12 @@ These override `messages.tts.*` for that host.
- 48kHz / 64kbps is a good voice-note tradeoff and required for the round bubble. - 48kHz / 64kbps is a good voice-note tradeoff and required for the round bubble.
- **Other channels**: MP3 (`mp3_44100_128` from ElevenLabs, `mp3` from OpenAI). - **Other channels**: MP3 (`mp3_44100_128` from ElevenLabs, `mp3` from OpenAI).
- 44.1kHz / 128kbps is the default balance for speech clarity. - 44.1kHz / 128kbps is the default balance for speech clarity.
- **Edge TTS**: uses `edge.outputFormat` (default `audio-24khz-48kbitrate-mono-mp3`). - **Microsoft**: uses `microsoft.outputFormat` (default `audio-24khz-48kbitrate-mono-mp3`).
- `node-edge-tts` accepts an `outputFormat`, but not all formats are available - The bundled transport accepts an `outputFormat`, but not all formats are available from the service.
from the Edge service. citeturn2search0 - Output format values follow Microsoft Speech output formats (including Ogg/WebM Opus).
- Output format values follow Microsoft Speech output formats (including Ogg/WebM Opus). citeturn1search0
- Telegram `sendVoice` accepts OGG/MP3/M4A; use OpenAI/ElevenLabs if you need - Telegram `sendVoice` accepts OGG/MP3/M4A; use OpenAI/ElevenLabs if you need
guaranteed Opus voice notes. citeturn1search1 guaranteed Opus voice notes. citeturn1search1
- If the configured Edge output format fails, OpenClaw retries with MP3. - If the configured Microsoft output format fails, OpenClaw retries with MP3.
OpenAI/ElevenLabs formats are fixed; Telegram expects Opus for voice-note UX. OpenAI/ElevenLabs formats are fixed; Telegram expects Opus for voice-note UX.

View File

@ -98,7 +98,7 @@ See the plugin docs for recommended ranges and production examples:
## TTS for calls ## TTS for calls
Voice Call uses the core `messages.tts` configuration (OpenAI or ElevenLabs) for Voice Call uses the core `messages.tts` configuration for
streaming speech on calls. Override examples and provider caveats live here: streaming speech on calls. Override examples and provider caveats live here:
`https://docs.openclaw.ai/plugins/voice-call#tts-for-calls` `https://docs.openclaw.ai/plugins/voice-call#tts-for-calls`