docs(plugins): document media capability ownership

This commit is contained in:
Peter Steinberger 2026-03-16 20:42:08 -07:00
parent 3e010e280a
commit 3566e88c08
No known key found for this signature in database
2 changed files with 52 additions and 14 deletions

View File

@ -10,6 +10,10 @@ title: "Media Understanding"
OpenClaw can **summarize inbound media** (image/audio/video) before the reply pipeline runs. It autodetects when local tools or provider keys are available, and can be disabled or customized. If understanding is off, models still receive the original files/URLs as usual.
Vendor-specific media behavior is registered by vendor plugins, while OpenClaw
core owns the shared `tools.media` config, fallback order, and reply-pipeline
integration.
## Goals
- Optional: predigest inbound media into short text for faster routing + better command parsing.
@ -184,7 +188,10 @@ If you set `capabilities`, the entry only runs for those media types. For shared
lists, OpenClaw can infer defaults:
- `openai`, `anthropic`, `minimax`: **image**
- `moonshot`: **image + video**
- `google` (Gemini API): **image + audio + video**
- `mistral`: **audio**
- `zai`: **image**
- `groq`: **audio**
- `deepgram`: **audio**
@ -193,11 +200,11 @@ If you omit `capabilities`, the entry is eligible for the list it appears in.
## Provider support matrix (OpenClaw integrations)
| Capability | Provider integration | Notes |
| ---------- | ------------------------------------------------ | --------------------------------------------------------- |
| Image | OpenAI / Anthropic / Google / others via `pi-ai` | Any image-capable model in the registry works. |
| Audio | OpenAI, Groq, Deepgram, Google, Mistral | Provider transcription (Whisper/Deepgram/Gemini/Voxtral). |
| Video | Google (Gemini API) | Provider video understanding. |
| Capability | Provider integration | Notes |
| ---------- | -------------------------------------------------- | ----------------------------------------------------------------------- |
| Image | OpenAI, Anthropic, Google, MiniMax, Moonshot, Z.AI | Vendor plugins register image support against core media understanding. |
| Audio | OpenAI, Groq, Deepgram, Google, Mistral | Provider transcription (Whisper/Deepgram/Gemini/Voxtral). |
| Video | Google, Moonshot | Provider video understanding via vendor plugins. |
## Model selection guidance

View File

@ -113,9 +113,11 @@ That means:
Examples:
- the bundled `openai` plugin owns OpenAI model-provider behavior and OpenAI
speech behavior
speech + media-understanding behavior
- the bundled `elevenlabs` plugin owns ElevenLabs speech behavior
- the bundled `microsoft` plugin owns Microsoft speech behavior
- the bundled `google`, `minimax`, `mistral`, `moonshot`, and `zai` plugins own
their media-understanding backends
- the `voice-call` plugin is a feature plugin: it owns call transport, tools,
CLI, routes, and runtime, but it consumes core TTS/STT capability instead of
inventing a second speech stack
@ -167,17 +169,23 @@ For example, TTS follows this shape:
That same pattern should be preferred for future capabilities.
### Capability example: video
### Capability example: video understanding
If OpenClaw adds video, prefer this order:
OpenClaw already treats image/audio/video understanding as one shared
capability. The same ownership model applies there:
1. define a core video capability
2. decide the shared contract: input media shape, provider result shape, cache/fallback behavior, and runtime helpers
3. let vendor plugins such as `openai` or a future video vendor register video implementations
4. let channels or feature plugins consume `api.runtime.video` instead of wiring directly to a provider plugin
1. core defines the media-understanding contract
2. vendor plugins register `describeImage`, `transcribeAudio`, and
`describeVideo` as applicable
3. channels and feature plugins consume the shared core behavior instead of
wiring directly to vendor code
This avoids baking one provider's video assumptions into core. The plugin owns
the vendor surface; core owns the capability contract.
That avoids baking one provider's video assumptions into core. The plugin owns
the vendor surface; core owns the capability contract and fallback behavior.
If OpenClaw adds a new domain later, such as video generation, use the same
sequence again: define the core capability first, then let vendor plugins
register implementations against it.
## Compatible bundles
@ -717,6 +725,28 @@ Notes:
text, speech, image, and future media providers as OpenClaw adds those
capability contracts.
For image/audio/video understanding, plugins register one typed
media-understanding provider instead of a generic key/value bag:
```ts
api.registerMediaUnderstandingProvider({
id: "google",
capabilities: ["image", "audio", "video"],
describeImage: async (req) => ({ text: "..." }),
transcribeAudio: async (req) => ({ text: "..." }),
describeVideo: async (req) => ({ text: "..." }),
});
```
Notes:
- Keep orchestration, fallback, config, and channel wiring in core.
- Keep vendor behavior in the provider plugin.
- Additive expansion should stay typed: new optional methods, new optional
result fields, new optional capabilities.
- If OpenClaw adds a new capability such as video generation later, define the
core capability contract first, then let vendor plugins register against it.
For STT/transcription, plugins can call:
```ts
@ -1294,6 +1324,7 @@ Plugins export either:
- `registerChannel`
- `registerProvider`
- `registerSpeechProvider`
- `registerMediaUnderstandingProvider`
- `registerWebSearchProvider`
- `registerHttpRoute`
- `registerCommand`