Generating and Delivering Captions with Wowza Streaming Engine

Learn how to use Wowza’s Caption Handlers Module with tools like OpenAI and Azure to transcribe and deliver subtitles and captions

In live streaming, delivering accurate captions or subtitles is increasingly crucial — for accessibility, compliance (e.g. for deaf or hard of hearing users), reach (multilingual or non-native speaker audiences), search & discoverability, and regulatory requirements. Wowza’s Caption Handlers plugin (including the whisperSpeechToText module) addresses exactly this: turning the audio track of a live stream into real-time transcription using Automatic Speech Recognition (ASR) engines.

Recently, a community member asked about using Azure vs OpenAI Whisper with the module. This sparked good discussion, and it’s worth summarizing where things stand, what the trade‑offs are, and what benefits you can expect. So let’s answer the question:

“Can I point the module to the public Whisper API from OpenAI, or do I need a local Whisper server?”

Using the Wowza Caption Handlers Module

First let’s set the stage and provide some context. For those unfamiliar, here’s a breakdown of how the module works, its key components, and what it supports:

  • The Caption Handlers module (wse-plugin-caption-handlers) is a plugin that allows live streams to have captions generated in real time using ASR.
    • Built for Wowza Streaming Engine version 4.9.4+
    • Requires Java 21
    • Supports two ASR options out of the box:
      • Azure Speech‑to‑Text/Azure AI Speech Services
        • Uses Microsoft’s cloud speech recognition
      • Open-Source Whisper/Whisper ASR system
        • Deployed on locally-hosted Whisper server or via Docker Compose preview setup that Wowza provides
    • Outputs subtitles/cues into the stream
      • Uses onTextData events
      • Can be formatted as WebVTT or CEA‑608/708 caption formats
    • Ready‑to‑use Docker Compose deployment
      • Includes a preconfigured Whisper server + Wowza with the module installed
      • Spin up a test/preview environment

Why The Public Whisper API Isn’t Fully Supported Yet

A key thing to note, from the Wowza Community thread, is that there may be some compatibility issues depending on how these modules are deployed and what ASR systems a user chooses to leverage. Notably, the Caption Handlers module expects a locally-hosted Whisper engine, and therefore will not work with the public OpenAI Whisper API at present. To solve for this, a user could either:

  • Use the provided Docker Compose setup from the GitHub repository, which includes a working Whisper server.
  • Host your own Whisper server (via the open-source Whisper codebase) and configure the module to point to it.

This is because the module expects “a local server that runs the Whisper engine” rather than the public hosted API (Wowza Community). The public OpenAI Whisper API has different protocols, possibly authentication, rate‐limits, response formats, or expectations. The module is built to communicate via a certain local interface (possibly HTTP endpoint you can control), but the public API doesn’t match those specs out of the box. Potentially, concerns about latency, reliability, or cost when streaming audio chunks continuously to a remote API; local hosting allows tighter control.

Unless the module is updated (or you write a custom module to “wrap” the public API), using the public Whisper API isn’t a turnkey option with this module right now. That said, we are always looking to improve and refine these open-source projects. If you have a solution or proposed improvement, simply create a branch on our Github.

The Benefits of Wowza’s Captioning & Subtitling Tools

Real-time transcriptions improve accessibility for people with hearing loss, non‑native speakers, noisy environments, etc. They also support regulatory compliance in many jurisdictions. Plus, captions enable improved search/indexing of content, and usage in contexts where audio may be turned off (e.g., social media, mobile usage). Read more about the benefits of captions in our blog post here.

Here are the key advantages of using Wowza to generate live captions and subtitles:

  1. Real‑Time Captions with Modern Formats:
    Streaming captions in WebVTT or CEA‑608/708 formats allows broad compatibility across devices and players. WebVTT enables internationalization, is recommended for modern browsers/players, and supports UTF‑8 with more flexible styling. (Wowza)
  2. Control & Customization:
    Hosting your own Whisper server allows you to control model version, latency, and compute resources, while customizing or tuning for specific situations (e.g. technical, medical, or accented speech).
  3. Reduced Dependence When Self‑Hosting:
    If you self‑host, you can avoid network latency, service unavailability, data privacy concerns, or escalating API costs. Whether hosting fully on-prem or in a hybrid approach, this provides added flexibility and cost control.
  4. Dual Backend Flexibility:
    With support for both Azure and Whisper, you can pick the most suitable backend ASR system depending on your needs (cost, latency, accuracy, regulations, supported language coverage, etc.). Azure might have strong support, enterprise SLAs, while Whisper can offer lower costs (if self‑hosted) and more freedom.
  5. Out-of-the-Box Functionality:
    Wowza provides the Caption Handlers module with flexible support to integrate with either Azure or Wowza as needed. With this capability, you don’t need to build any audio ingest, speech-to-text conversion, subtitle injection, or output formatting.

Limitations and Trade‑offs with AI Caption Generation

There are significant benefits to using these tools, but no solution is perfect. It is important to take note of limitations with ASR systems in generating subtitles or captions. While these tools offer speed, they may not be perfectly accurate or reliable. These are not limitations specific to Wowza, but more so of the ASR workflow itself:

  • Latency/Delay:
    There will always be latency introduced from capturing audio, sending to the ASR engine, processing, returning text, formatting, and injecting into stream. For fast‑paced content (news, sports, live discussions), even small delays matter.
  • Low-Fidelity Audio and Special Words:
    Background noise, accents, overlapping speech, domain‑specific vocabulary (technical, medical, slang), three-letter acronyms, or low audio quality can degrade performance. Also, Whisper may have higher error rates compared to specialized or more heavily tuned speech services.
  • AI Hallucinations/Errors:
    On the note of accuracy, ASR systems (including Whisper) sometimes produce “hallucinated” text — inserted words not actually spoken, mistakes. It’s important to have fallback or validation especially for contexts where accuracy is critical. Some studies show even small percentages of severe hallucinations. (arXiv)
  • Self-Hosting Resource Requirements:
    If you run Whisper locally, you need to allocate compute resources, ensure the server is stable, manage scaling, monitoring, etc. Model size matters; larger Whisper models give better accuracy but require more resources.
  • Cost vs Operational Overhead:
    Using Azure or any paid ASR has ongoing costs including hardware/infrastructure and maintenance costs. You have to balance those with your needs and budgets, while considering go-forward implications as you scale.
  • Potential Legal/Privacy Concerns:
    If your content includes sensitive or private speech, ensuring data handling meets legal/privacy regulations is crucial. Self‑hosting helps if the environment is secure.

Wowza Caption Handler Plugin Sample Workflows

If you’re considering implementing this, there are two “typical” setups, and key configure‑points, that Wowza recommends. Of course, Wowza provides complete flexibility and control to build your system however you need it to work. These tested workflows can provide the out-of-the-box functionality you need or serve as a foundation to build advanced capabilities on top of.

ScenarioComponentsProsCons
Self‑Hosted Whisper + WowzaWowza Streaming Engine → module → local Whisper server (Docker or bare‑metal) → Subtitle output (WebVTT/CEA‑608/708)Full control; no external dependency; lower per‑minute cost; possible better data privacy; flexible model/versioningMust manage and monitor the Whisper server; hardware requirements; scaling; latency management; updates; possibly more setup complexity
Azure Speech‑to‑Text + WowzaWowza module → Azure Speech API → Captions outputSLA from Azure; likely good performance; managed service; possibly wider support for languages, accents; less server maintenanceOngoing costs; network latency; dependencies on external service; privacy/compliance constraints depending on data jurisdiction; less control over model internals

Key configuration notes:

  • Ensure Wowza version 4.9.4 or greater and Java 21
  • For Whisper, point the module to a properly configured server (host, port, correct API format)
  • For Azure, set up your subscription, keys, region, etc.
  • Configure delay/buffering parameters to balance latency vs stability/caption completeness
  • Choose output format: WebVTT (recommended for modern use) vs CEA‑608/708 depending on playback devices
  • Monitor error rates, latency, etc., and possibly plan fallbacks or human review in critical scenarios

Practical Use Cases

Here are some scenarios where using Wowza Caption Handlers (pointed toward either Azure or Whisper) shines:

  • Live events/broadcasts — conferences, lectures, church services — where captions are needed on video in near real‑time.
  • Accessibility for audiences — people with hearing disabilities; in noisy or muted environments.
  • Regulatory compliance — many jurisdictions or platforms require captions.
  • Multilingual or accented speaker support — Whisper’s multilingual capability helps. Azure also supports multiple languages / dialects.
  • Search, indexing, VOD content — even after live stream ends, the transcription can be used for subtitles, for content indexing, SEO, or generating transcripts.

Recommendations & Next Steps

If you want to implement or test this, here are our suggestions:

  1. Try the Docker Compose preview from the wse-plugin-caption-handlers repo since that gives you a nearly complete setup including a local Whisper server. It’s the fastest path to test. (Wowza)
  2. Benchmark both backends (Whisper local vs Azure) for your actual content – your speakers, accents, noise, domain vocabulary – to see what error rates, latency, and resource demands are.
  3. Keep an eye on updates from Wowza for plugin or module updates. Also watch for model updates from Whisper/OpenAI and Azure that may improve accuracy or latency trade‑offs.
  4. Handle error/fallback logic and consider human review, manual correction options, or flagging uncertain segments.
  5. Secure your data, especially if self‑hosting. Ensure encrypted connections, access control, privacy compliance, etc.

Conclusion

Wowza’s Caption Handlers module is a powerful tool for adding real-time transcriptions to live streams, offering flexibility via Azure or Whisper depending on your needs. The key benefit is enhancing accessibility and viewer experience, while giving you full control over how the system works. The module helps streamline workflows, minimize latency, and provides greater reliability. With the option to use Azure or Whisper, you can balance cost, performance, control, and data privacy in the way that best fits your workflow.

At Wowa, we aim to provide the most reliable and flexible tools to enable your media operations. If you want to build on top of our Caption Handlers plugin, feel free to create a branch on our Github. We greatly appreciate the feedback and collaboration to build open-source projects everyone can draw value from.

Try Wowza for free today at https://www.wowza.com/free-trial

About Ian Zenoni

Ian Zenoni has been in the video industry for over 20 years and at Wowza for over 10. While at Wowza Ian has architected, built, and deployed solutions and services for live video streaming, both in the cloud and on premises. As Chief Architect Ian researches the latest technology in video streaming to integrate into Wowza’s products and services. He is also a co-organizer of the local Denver Video meetup group that meets quarterly in the Denver metro area.
View More

FREE TRIAL

Live stream and Video On Demand for the web, apps, and onto any device. Get started in minutes.

START STREAMING!
  • Stream with WebRTC, HLS and MPEG-DASH
  • Fully customizable with REST and Java APIs
  • Integrate and embed into your apps

Search Wowza Resources


Subscribe


Follow Us


Categories

Blog

Back to All Posts