docs(plugins): document capability ownership model
This commit is contained in:
parent
662031a88e
commit
6da9ba3267
@ -204,7 +204,7 @@ Example with a stable public host:
|
||||
|
||||
## TTS for calls
|
||||
|
||||
Voice Call uses the core `messages.tts` configuration (OpenAI or ElevenLabs) for
|
||||
Voice Call uses the core `messages.tts` configuration for
|
||||
streaming speech on calls. You can override it under the plugin config with the
|
||||
**same shape** — it deep‑merges with `messages.tts`.
|
||||
|
||||
@ -222,7 +222,7 @@ streaming speech on calls. You can override it under the plugin config with the
|
||||
|
||||
Notes:
|
||||
|
||||
- **Edge TTS is ignored for voice calls** (telephony audio needs PCM; Edge output is unreliable).
|
||||
- **Microsoft speech is ignored for voice calls** (telephony audio needs PCM; the current Microsoft transport does not expose telephony PCM output).
|
||||
- Core TTS is used when Twilio media streaming is enabled; otherwise calls fall back to provider native voices.
|
||||
|
||||
### More examples
|
||||
|
||||
@ -97,6 +97,76 @@ The important design boundary:
|
||||
That split lets OpenClaw validate config, explain missing/disabled plugins, and
|
||||
build UI/schema hints before the full runtime is active.
|
||||
|
||||
## Capability ownership model
|
||||
|
||||
OpenClaw treats a native plugin as the ownership boundary for a **company** or a
|
||||
**feature**, not as a grab bag of unrelated integrations.
|
||||
|
||||
That means:
|
||||
|
||||
- a company plugin should usually own all of that company's OpenClaw-facing
|
||||
surfaces
|
||||
- a feature plugin should usually own the full feature surface it introduces
|
||||
- channels should consume shared core capabilities instead of re-implementing
|
||||
provider behavior ad hoc
|
||||
|
||||
Examples:
|
||||
|
||||
- the bundled `openai` plugin owns OpenAI model-provider behavior and OpenAI
|
||||
speech behavior
|
||||
- the bundled `elevenlabs` plugin owns ElevenLabs speech behavior
|
||||
- the bundled `microsoft` plugin owns Microsoft speech behavior
|
||||
- the `voice-call` plugin is a feature plugin: it owns call transport, tools,
|
||||
CLI, routes, and runtime, but it consumes core TTS/STT capability instead of
|
||||
inventing a second speech stack
|
||||
|
||||
The intended end state is:
|
||||
|
||||
- OpenAI lives in one plugin even if it spans text models, speech, images, and
|
||||
future video
|
||||
- another vendor can do the same for its own surface area
|
||||
- channels do not care which vendor plugin owns the provider; they consume the
|
||||
shared capability contract exposed by core
|
||||
|
||||
This is the key distinction:
|
||||
|
||||
- **plugin** = ownership boundary
|
||||
- **capability** = core contract that multiple plugins can implement or consume
|
||||
|
||||
So if OpenClaw adds a new domain such as video, the first question is not
|
||||
"which provider should hardcode video handling?" The first question is "what is
|
||||
the core video capability contract?" Once that contract exists, vendor plugins
|
||||
can register against it and channel/feature plugins can consume it.
|
||||
|
||||
If the capability does not exist yet, the right move is usually:
|
||||
|
||||
1. define the missing capability in core
|
||||
2. expose it through the plugin API/runtime in a typed way
|
||||
3. wire channels/features against that capability
|
||||
4. let vendor plugins register implementations
|
||||
|
||||
This keeps ownership explicit while avoiding core behavior that depends on a
|
||||
single vendor or a one-off plugin-specific code path.
|
||||
|
||||
### Capability layering
|
||||
|
||||
Use this mental model when deciding where code belongs:
|
||||
|
||||
- **core capability layer**: shared orchestration, policy, fallback, config
|
||||
merge rules, delivery semantics, and typed contracts
|
||||
- **vendor plugin layer**: vendor-specific APIs, auth, model catalogs, speech
|
||||
synthesis, image generation, future video backends, usage endpoints
|
||||
- **channel/feature plugin layer**: Slack/Discord/voice-call/etc. integration
|
||||
that consumes core capabilities and presents them on a surface
|
||||
|
||||
For example, TTS follows this shape:
|
||||
|
||||
- core owns reply-time TTS policy, fallback order, prefs, and channel delivery
|
||||
- `openai`, `elevenlabs`, and `microsoft` own synthesis implementations
|
||||
- `voice-call` consumes the telephony TTS runtime helper
|
||||
|
||||
That same pattern should be preferred for future capabilities.
|
||||
|
||||
## Compatible bundles
|
||||
|
||||
OpenClaw also recognizes two compatible external bundle layouts:
|
||||
@ -193,6 +263,8 @@ Important trust note:
|
||||
- Model Studio provider catalog — bundled as `modelstudio` (enabled by default)
|
||||
- Moonshot provider runtime — bundled as `moonshot` (enabled by default)
|
||||
- NVIDIA provider catalog — bundled as `nvidia` (enabled by default)
|
||||
- ElevenLabs speech provider — bundled as `elevenlabs` (enabled by default)
|
||||
- Microsoft speech provider — bundled as `microsoft` (enabled by default; legacy `edge` input maps here)
|
||||
- OpenAI provider runtime — bundled as `openai` (enabled by default; owns both `openai` and `openai-codex`)
|
||||
- OpenCode Go provider capabilities — bundled as `opencode-go` (enabled by default)
|
||||
- OpenCode Zen provider capabilities — bundled as `opencode` (enabled by default)
|
||||
@ -218,6 +290,8 @@ Native OpenClaw plugins can register:
|
||||
- Gateway HTTP routes
|
||||
- Agent tools
|
||||
- CLI commands
|
||||
- Speech providers
|
||||
- Web search providers
|
||||
- Background services
|
||||
- Context engines
|
||||
- Provider auth flows and model catalogs
|
||||
@ -229,6 +303,62 @@ Native OpenClaw plugins can register:
|
||||
Native OpenClaw plugins run **in‑process** with the Gateway, so treat them as trusted code.
|
||||
Tool authoring guide: [Plugin agent tools](/plugins/agent-tools).
|
||||
|
||||
Think of these registrations as **capability claims**. A plugin is not supposed
|
||||
to reach into random internals and "just make it work." It should register
|
||||
against explicit surfaces that OpenClaw understands, validates, and can expose
|
||||
consistently across config, onboarding, status, docs, and runtime behavior.
|
||||
|
||||
## Contracts and enforcement
|
||||
|
||||
The plugin API surface is intentionally typed and centralized in
|
||||
`OpenClawPluginApi`. That contract defines the supported registration points and
|
||||
the runtime helpers a plugin may rely on.
|
||||
|
||||
Why this matters:
|
||||
|
||||
- plugin authors get one stable internal standard
|
||||
- core can reject duplicate ownership such as two plugins registering the same
|
||||
provider id
|
||||
- startup can surface actionable diagnostics for malformed registration
|
||||
- contract tests can enforce bundled-plugin ownership and prevent silent drift
|
||||
|
||||
There are two layers of enforcement:
|
||||
|
||||
1. **runtime registration enforcement**
|
||||
The plugin registry validates registrations as plugins load. Examples:
|
||||
duplicate provider ids, duplicate speech provider ids, and malformed
|
||||
registrations produce plugin diagnostics instead of undefined behavior.
|
||||
2. **contract tests**
|
||||
Bundled plugins are captured in contract registries during test runs so
|
||||
OpenClaw can assert ownership explicitly. Today this is used for model
|
||||
providers, web search providers, and bundled registration ownership.
|
||||
|
||||
The practical effect is that OpenClaw knows, up front, which plugin owns which
|
||||
surface. That lets core and channels compose seamlessly because ownership is
|
||||
declared, typed, and testable rather than implicit.
|
||||
|
||||
### What belongs in a contract
|
||||
|
||||
Good plugin contracts are:
|
||||
|
||||
- typed
|
||||
- small
|
||||
- capability-specific
|
||||
- owned by core
|
||||
- reusable by multiple plugins
|
||||
- consumable by channels/features without vendor knowledge
|
||||
|
||||
Bad plugin contracts are:
|
||||
|
||||
- vendor-specific policy hidden in core
|
||||
- one-off plugin escape hatches that bypass the registry
|
||||
- channel code reaching straight into a vendor implementation
|
||||
- ad hoc runtime objects that are not part of `OpenClawPluginApi` or
|
||||
`api.runtime`
|
||||
|
||||
When in doubt, raise the abstraction level: define the capability first, then
|
||||
let plugins plug into it.
|
||||
|
||||
## Provider runtime hooks
|
||||
|
||||
Provider plugins now have two layers:
|
||||
@ -530,9 +660,36 @@ const result = await api.runtime.tts.textToSpeechTelephony({
|
||||
|
||||
Notes:
|
||||
|
||||
- Uses core `messages.tts` configuration (OpenAI or ElevenLabs).
|
||||
- Uses core `messages.tts` configuration and provider selection.
|
||||
- Returns PCM audio buffer + sample rate. Plugins must resample/encode for providers.
|
||||
- Edge TTS is not supported for telephony.
|
||||
- OpenAI and ElevenLabs support telephony today. Microsoft does not.
|
||||
|
||||
Plugins can also register speech providers via `api.registerSpeechProvider(...)`.
|
||||
|
||||
```ts
|
||||
api.registerSpeechProvider({
|
||||
id: "acme-speech",
|
||||
label: "Acme Speech",
|
||||
isConfigured: ({ config }) => Boolean(config.messages?.tts),
|
||||
synthesize: async (req) => {
|
||||
return {
|
||||
audioBuffer: Buffer.from([]),
|
||||
outputFormat: "mp3",
|
||||
fileExtension: ".mp3",
|
||||
voiceCompatible: false,
|
||||
};
|
||||
},
|
||||
});
|
||||
```
|
||||
|
||||
Notes:
|
||||
|
||||
- Keep TTS policy, fallback, and reply delivery in core.
|
||||
- Use speech providers for vendor-owned synthesis behavior.
|
||||
- Legacy Microsoft `edge` input is normalized to the `microsoft` provider id.
|
||||
- The preferred ownership model is company-oriented: one vendor plugin can own
|
||||
text, speech, image, and future media providers as OpenClaw adds those
|
||||
capability contracts.
|
||||
|
||||
For STT/transcription, plugins can call:
|
||||
|
||||
@ -1110,12 +1267,49 @@ Plugins export either:
|
||||
- `on(...)` for typed lifecycle hooks
|
||||
- `registerChannel`
|
||||
- `registerProvider`
|
||||
- `registerSpeechProvider`
|
||||
- `registerWebSearchProvider`
|
||||
- `registerHttpRoute`
|
||||
- `registerCommand`
|
||||
- `registerCli`
|
||||
- `registerContextEngine`
|
||||
- `registerService`
|
||||
|
||||
In practice, `register(api)` is also where a plugin declares **ownership**.
|
||||
That ownership should map cleanly to either:
|
||||
|
||||
- a vendor surface such as OpenAI, ElevenLabs, or Microsoft
|
||||
- a feature surface such as Voice Call
|
||||
|
||||
Avoid splitting one vendor's capabilities across unrelated plugins unless there
|
||||
is a strong product reason to do so. The default should be one plugin per
|
||||
vendor/feature, with core capability contracts separating shared orchestration
|
||||
from vendor-specific behavior.
|
||||
|
||||
## Adding a new capability
|
||||
|
||||
When a plugin needs behavior that does not fit the current API, do not bypass
|
||||
the plugin system with a private reach-in. Add the missing capability.
|
||||
|
||||
Recommended sequence:
|
||||
|
||||
1. define the core contract
|
||||
Decide what shared behavior core should own: policy, fallback, config merge,
|
||||
lifecycle, channel-facing semantics, and runtime helper shape.
|
||||
2. add typed plugin registration/runtime surfaces
|
||||
Extend `OpenClawPluginApi` and/or `api.runtime` with the smallest useful
|
||||
typed seam.
|
||||
3. wire core + channel/feature consumers
|
||||
Channels and feature plugins should consume the new capability through core,
|
||||
not by importing a vendor implementation directly.
|
||||
4. register vendor implementations
|
||||
Vendor plugins then register their backends against the capability.
|
||||
5. add contract coverage
|
||||
Add tests so ownership and registration shape stay explicit over time.
|
||||
|
||||
This is how OpenClaw stays opinionated without becoming hardcoded to one
|
||||
provider's worldview.
|
||||
|
||||
Context engine plugins can also register a runtime-owned context manager:
|
||||
|
||||
```ts
|
||||
|
||||
75
docs/tts.md
75
docs/tts.md
@ -9,26 +9,27 @@ title: "Text-to-Speech"
|
||||
|
||||
# Text-to-speech (TTS)
|
||||
|
||||
OpenClaw can convert outbound replies into audio using ElevenLabs, OpenAI, or Edge TTS.
|
||||
OpenClaw can convert outbound replies into audio using ElevenLabs, Microsoft, or OpenAI.
|
||||
It works anywhere OpenClaw can send audio; Telegram gets a round voice-note bubble.
|
||||
|
||||
## Supported services
|
||||
|
||||
- **ElevenLabs** (primary or fallback provider)
|
||||
- **Microsoft** (primary or fallback provider; current bundled implementation uses `node-edge-tts`, default when no API keys)
|
||||
- **OpenAI** (primary or fallback provider; also used for summaries)
|
||||
- **Edge TTS** (primary or fallback provider; uses `node-edge-tts`, default when no API keys)
|
||||
|
||||
### Edge TTS notes
|
||||
### Microsoft speech notes
|
||||
|
||||
Edge TTS uses Microsoft Edge's online neural TTS service via the `node-edge-tts`
|
||||
library. It's a hosted service (not local), uses Microsoft’s endpoints, and does
|
||||
not require an API key. `node-edge-tts` exposes speech configuration options and
|
||||
output formats, but not all options are supported by the Edge service. citeturn2search0
|
||||
The bundled Microsoft speech provider currently uses Microsoft Edge's online
|
||||
neural TTS service via the `node-edge-tts` library. It's a hosted service (not
|
||||
local), uses Microsoft endpoints, and does not require an API key.
|
||||
`node-edge-tts` exposes speech configuration options and output formats, but
|
||||
not all options are supported by the service. Legacy config and directive input
|
||||
using `edge` still works and is normalized to `microsoft`.
|
||||
|
||||
Because Edge TTS is a public web service without a published SLA or quota, treat it
|
||||
as best-effort. If you need guaranteed limits and support, use OpenAI or ElevenLabs.
|
||||
Microsoft's Speech REST API documents a 10‑minute audio limit per request; Edge TTS
|
||||
does not publish limits, so assume similar or lower limits. citeturn0search3
|
||||
Because this path is a public web service without a published SLA or quota,
|
||||
treat it as best-effort. If you need guaranteed limits and support, use OpenAI
|
||||
or ElevenLabs.
|
||||
|
||||
## Optional keys
|
||||
|
||||
@ -37,8 +38,9 @@ If you want OpenAI or ElevenLabs:
|
||||
- `ELEVENLABS_API_KEY` (or `XI_API_KEY`)
|
||||
- `OPENAI_API_KEY`
|
||||
|
||||
Edge TTS does **not** require an API key. If no API keys are found, OpenClaw defaults
|
||||
to Edge TTS (unless disabled via `messages.tts.edge.enabled=false`).
|
||||
Microsoft speech does **not** require an API key. If no API keys are found,
|
||||
OpenClaw defaults to Microsoft (unless disabled via
|
||||
`messages.tts.microsoft.enabled=false` or `messages.tts.edge.enabled=false`).
|
||||
|
||||
If multiple providers are configured, the selected provider is used first and the others are fallback options.
|
||||
Auto-summary uses the configured `summaryModel` (or `agents.defaults.model.primary`),
|
||||
@ -58,7 +60,7 @@ so that provider must also be authenticated if you enable summaries.
|
||||
No. Auto‑TTS is **off** by default. Enable it in config with
|
||||
`messages.tts.auto` or per session with `/tts always` (alias: `/tts on`).
|
||||
|
||||
Edge TTS **is** enabled by default once TTS is on, and is used automatically
|
||||
Microsoft speech **is** enabled by default once TTS is on, and is used automatically
|
||||
when no OpenAI or ElevenLabs API keys are available.
|
||||
|
||||
## Config
|
||||
@ -118,15 +120,15 @@ Full schema is in [Gateway configuration](/gateway/configuration).
|
||||
}
|
||||
```
|
||||
|
||||
### Edge TTS primary (no API key)
|
||||
### Microsoft primary (no API key)
|
||||
|
||||
```json5
|
||||
{
|
||||
messages: {
|
||||
tts: {
|
||||
auto: "always",
|
||||
provider: "edge",
|
||||
edge: {
|
||||
provider: "microsoft",
|
||||
microsoft: {
|
||||
enabled: true,
|
||||
voice: "en-US-MichelleNeural",
|
||||
lang: "en-US",
|
||||
@ -139,13 +141,13 @@ Full schema is in [Gateway configuration](/gateway/configuration).
|
||||
}
|
||||
```
|
||||
|
||||
### Disable Edge TTS
|
||||
### Disable Microsoft speech
|
||||
|
||||
```json5
|
||||
{
|
||||
messages: {
|
||||
tts: {
|
||||
edge: {
|
||||
microsoft: {
|
||||
enabled: false,
|
||||
},
|
||||
},
|
||||
@ -205,9 +207,10 @@ Then run:
|
||||
- `tagged` only sends audio when the reply includes `[[tts]]` tags.
|
||||
- `enabled`: legacy toggle (doctor migrates this to `auto`).
|
||||
- `mode`: `"final"` (default) or `"all"` (includes tool/block replies).
|
||||
- `provider`: `"elevenlabs"`, `"openai"`, or `"edge"` (fallback is automatic).
|
||||
- `provider`: speech provider id such as `"elevenlabs"`, `"microsoft"`, or `"openai"` (fallback is automatic).
|
||||
- If `provider` is **unset**, OpenClaw prefers `openai` (if key), then `elevenlabs` (if key),
|
||||
otherwise `edge`.
|
||||
otherwise `microsoft`.
|
||||
- Legacy `provider: "edge"` still works and is normalized to `microsoft`.
|
||||
- `summaryModel`: optional cheap model for auto-summary; defaults to `agents.defaults.model.primary`.
|
||||
- Accepts `provider/model` or a configured model alias.
|
||||
- `modelOverrides`: allow the model to emit TTS directives (on by default).
|
||||
@ -227,15 +230,16 @@ Then run:
|
||||
- `elevenlabs.applyTextNormalization`: `auto|on|off`
|
||||
- `elevenlabs.languageCode`: 2-letter ISO 639-1 (e.g. `en`, `de`)
|
||||
- `elevenlabs.seed`: integer `0..4294967295` (best-effort determinism)
|
||||
- `edge.enabled`: allow Edge TTS usage (default `true`; no API key).
|
||||
- `edge.voice`: Edge neural voice name (e.g. `en-US-MichelleNeural`).
|
||||
- `edge.lang`: language code (e.g. `en-US`).
|
||||
- `edge.outputFormat`: Edge output format (e.g. `audio-24khz-48kbitrate-mono-mp3`).
|
||||
- See Microsoft Speech output formats for valid values; not all formats are supported by Edge.
|
||||
- `edge.rate` / `edge.pitch` / `edge.volume`: percent strings (e.g. `+10%`, `-5%`).
|
||||
- `edge.saveSubtitles`: write JSON subtitles alongside the audio file.
|
||||
- `edge.proxy`: proxy URL for Edge TTS requests.
|
||||
- `edge.timeoutMs`: request timeout override (ms).
|
||||
- `microsoft.enabled`: allow Microsoft speech usage (default `true`; no API key).
|
||||
- `microsoft.voice`: Microsoft neural voice name (e.g. `en-US-MichelleNeural`).
|
||||
- `microsoft.lang`: language code (e.g. `en-US`).
|
||||
- `microsoft.outputFormat`: Microsoft output format (e.g. `audio-24khz-48kbitrate-mono-mp3`).
|
||||
- See Microsoft Speech output formats for valid values; not all formats are supported by the bundled Edge-backed transport.
|
||||
- `microsoft.rate` / `microsoft.pitch` / `microsoft.volume`: percent strings (e.g. `+10%`, `-5%`).
|
||||
- `microsoft.saveSubtitles`: write JSON subtitles alongside the audio file.
|
||||
- `microsoft.proxy`: proxy URL for Microsoft speech requests.
|
||||
- `microsoft.timeoutMs`: request timeout override (ms).
|
||||
- `edge.*`: legacy alias for the same Microsoft settings.
|
||||
|
||||
## Model-driven overrides (default on)
|
||||
|
||||
@ -260,7 +264,7 @@ Here you go.
|
||||
|
||||
Available directive keys (when enabled):
|
||||
|
||||
- `provider` (`openai` | `elevenlabs` | `edge`, requires `allowProvider: true`)
|
||||
- `provider` (registered speech provider id, for example `openai`, `elevenlabs`, or `microsoft`; requires `allowProvider: true`)
|
||||
- `voice` (OpenAI voice) or `voiceId` (ElevenLabs)
|
||||
- `model` (OpenAI TTS model or ElevenLabs model id)
|
||||
- `stability`, `similarityBoost`, `style`, `speed`, `useSpeakerBoost`
|
||||
@ -319,13 +323,12 @@ These override `messages.tts.*` for that host.
|
||||
- 48kHz / 64kbps is a good voice-note tradeoff and required for the round bubble.
|
||||
- **Other channels**: MP3 (`mp3_44100_128` from ElevenLabs, `mp3` from OpenAI).
|
||||
- 44.1kHz / 128kbps is the default balance for speech clarity.
|
||||
- **Edge TTS**: uses `edge.outputFormat` (default `audio-24khz-48kbitrate-mono-mp3`).
|
||||
- `node-edge-tts` accepts an `outputFormat`, but not all formats are available
|
||||
from the Edge service. citeturn2search0
|
||||
- Output format values follow Microsoft Speech output formats (including Ogg/WebM Opus). citeturn1search0
|
||||
- **Microsoft**: uses `microsoft.outputFormat` (default `audio-24khz-48kbitrate-mono-mp3`).
|
||||
- The bundled transport accepts an `outputFormat`, but not all formats are available from the service.
|
||||
- Output format values follow Microsoft Speech output formats (including Ogg/WebM Opus).
|
||||
- Telegram `sendVoice` accepts OGG/MP3/M4A; use OpenAI/ElevenLabs if you need
|
||||
guaranteed Opus voice notes. citeturn1search1
|
||||
- If the configured Edge output format fails, OpenClaw retries with MP3.
|
||||
- If the configured Microsoft output format fails, OpenClaw retries with MP3.
|
||||
|
||||
OpenAI/ElevenLabs formats are fixed; Telegram expects Opus for voice-note UX.
|
||||
|
||||
|
||||
@ -98,7 +98,7 @@ See the plugin docs for recommended ranges and production examples:
|
||||
|
||||
## TTS for calls
|
||||
|
||||
Voice Call uses the core `messages.tts` configuration (OpenAI or ElevenLabs) for
|
||||
Voice Call uses the core `messages.tts` configuration for
|
||||
streaming speech on calls. Override examples and provider caveats live here:
|
||||
`https://docs.openclaw.ai/plugins/voice-call#tts-for-calls`
|
||||
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user