The whisperSpeechToText module for Wowza Streaming Engine™ media server software can be used to receive audio from an incoming source stream and send that raw audio to OpenAI Whisper. Whisper's speech recognition service processes the audio data and returns captions for display alongside your live stream. For available models and languages, see the whisper project on GitHub.
The module automatically enables captions for WebVTT output, which we generally recommend. However, it's also possible to configure it for CEA-608/708 captions. When used with the Whisper service, the module is capable of transcribing audio into captions. It can also translate the source audio into different language tracks.
Translations are accomplished using the LibreTranslate open-source machine translation API, which allows you to translate text between languages without relying on proprietary cloud services. The project is written in Python and built on Argos Translate, supporting the following languages.
You can get the whisperSpeechToText source code from the wse-plugin-caption-handlers repository on GitHub.
Prerequisites
To work with the whisperSpeechToText module, you must meet the following prerequisites:
- You must have Wowza Streaming Engine 4.9.4 or later installed and use Java 21.
- If you plan to preview the module using Docker Compose, install and run Docker Desktop.
- If you're not using Docker Compose to preview the module, you need to manually set up your own Whisper server.
Usage
You can preview the whisperSpeechToText module using our Docker Compose deployment, or you can manually install the module in your existing Wowza Streaming Engine installation. Select one of the following workflows depending on your use case:
A successful setup utilizes Whisper's automatic speech recognition (ASR) system to convert audio from a source stream into text, which is then injected into the Wowza Streaming Engine live stream as onTextData. Once the onTextData is inserted into the stream, you can configure Wowza Streaming Engine to output CEA-608/708 or WebVTT captions.
For most modern use cases, we recommend using WebVTT captions, as they offer rich styling and customization options, full UTF-8 encoding for internationalization, and native support in multiple browsers and players.
Preview the module with Docker Compose
To preview this module, you can use our docker-compose.yaml deployment. We describe a similar process in the Set up Wowza Streaming Engine using a Docker Compose deployment article, where you can find additional information about environment variables.
This Docker Compose workflow is pre-configured to start a Wowza Streaming Engine instance with the whisperSpeechToText module installed and set up to leverage Whisper's ASR services. It also installs and sets up a Whisper server to automatically detect the audio input and transcribes it into WebVTT captions using the detected language.
Note: If you're trying to manually add the module to an existing installation of Wowza Streaming Engine, continue with the Install the module section instead. When manually installing the module, you have to set up your own Whisper server.
To use the Docker Compose preview deployment, follow these steps.
- Install Docker Desktop, which includes the Docker Engine and the Docker Compose plugin.
- Make sure Docker Desktop and Docker Engine are running.
- Clone the wse-plugin-caption-handlers repo:
git clone git@github.com:WowzaMediaSystems/wse-plugin-caption-handlers.git
- Change the directory to the wse-plugin-caption-handlers repo:
cd wse-plugin-caption-handlers
- Update the WSE_LICENSE_KEY variable in the local docker-compose.yaml file with your Wowza Streaming Engine key:
export WSE_LICENSE_KEY=[your-license-key]
Note: If you set the license key using the described method, it doesn't persist between terminal sessions and each time you run the Docker container or reboot your server. For a more consistent experience, you can directly add the license key to the docker-compose.yaml file or use a .env file to store sensitive data.
- (Optional) To enable translations via the LibreTranslate translation service:
- In the local docker-compose.yaml file, uncomment these lines for the libretranslate_server service:
libretranslate_server: hostname: libretranslate.server image: libretranslate/libretranslate:latest # image: libretranslate/libretranslate:latest-cuda environment: - LT_LOAD_ONLY=en,fr,es,de,ja ports: - 5001:5000 volumes: - /tmp/libretranslate_models_cache:/home/libretranslate/.local
- In the local docker-compose.yaml file, uncomment and update the following environment variables for the whisper_server service:
- SOURCE_LANGUAGE=en - REPORT_LANGUAGES=en,fr,es,de,ja - TRANSLATE_HOST=libretranslate.server - TRANSLATE_PORT=5000
Notes:
- The SOURCE_LANGUAGE specifies the input language code used for translation. This value defines the language of the text or audio to be translated. Use a valid ISO 639-1 two-letter language code to specify the language.
- The REPORT_LANGUAGES environment variable defines the target languages to be used for translation. To ensure consistency between loaded models and reporting output, it must include the same language codes specified in the LT_LOAD_ONLY environment variable. Use a valid ISO 639-1 two-letter language code to specify the language.
- The libretranslate/libretranslate:latest-cuda image can be used if you're planning to enable GPU acceleration with translations.
- In the local Application.xml file, uncomment and update the following lines, making sure the required language codes are included:
<TimedText> <!-- Properties for TimedText --> <Properties> <Property> <Name>captionLiveIngestLanguages</Name> <Value>en,fr,es,de,ja</Value> </Property> </Properties> </TimedText>
- (Optional) To enable GPU processing and take advantage of hardware acceleration:
- In the local docker-compose.yaml file, uncomment and update these lines for the whisper_server service:
whisper_server: hostname: whisper.server image: wowza/whisper_streaming:latest-gpu runtime: nvidia deploy: resources: reservations: devices: - driver: nvidia capabilities: [gpu] count: 1
Note: The wowza/whisper_streaming:latest-gpu image on Docker Hub is preconfigured and optimized for GPU acceleration. It includes all necessary settings and dependencies to enable GPU support out of the box.
- In the local docker-compose.yaml file, uncomment and update the following environment variables for the whisper_server service:
- MODEL=large-v3-turbo - USE_GPU=true - FP16=true
Notes:
- We recommend using the large-v3-turbo Whisper model when enabling GPU processing.
- The USE_GPU boolean enables GPU acceleration for faster inference.
- The FP16 variable uses 16-bit floating point precision for faster processing only on GPUs.
- From your local wse-plugin-caption-handlers repo, run:
docker compose up
- Open a new browser tab and go to:
http://localhost:8088/login.htm?host=http://wse.docker:8087
Note: When you click the Server link, confirm the http://wse.docker:8087 URL displays.
- Log in to Wowza Streaming Engine using the credentials from the docker-compose.yaml file.
- Go to Applications and click the whisper application.
- Check the Modules tab for the whisper application, which includes the whisperSpeechToText module.
- Go to the Properties tab and view the Custom properties. They are pre-configured to work with the Whisper ASR service.
- Start a stream and send it to your Wowza Streaming Engine server using the following server and stream key. For more about publishing live streams, see Connect a live source to Wowza Streaming Engine.
rtmp://wse-demo.wowza.com/whisper/myStream
- To test playback and see the automatically generated WebVTT captions, go to our Wowza Test Player and use this URL:
http://localhost:1935/whisper/myStream_delayed/playlist.m3u8
Set up the module without Docker Compose
If you already have Wowza Streaming Engine installed and don't plan to use the Docker Compose deployment to preview the pre-configured whisperSpeechToText module, you can install and configure the standalone module using the steps in this section.
Install the module
To manually install the standalone module without using our Dockerized solution, follow these steps.
- Download the wse-plugin-caption-handlers-[version].jar file from the latest plugin release version.
- Copy the wse-plugin-caption-handlers-[version].jar file to the [install-dir]/ lib folder in your Wowza Streaming Engine installation.
- Enable the Wowza Streaming Engine Transcoder for your live application.
Notes:
- The Transcoder must be enabled to resample the audio and send it in a specific format to the Whisper service.
- To bypass transcoding of the final output, set the Fallback Template to None and remove any named templates matching the stream name.
- Stream name groups do not provide the proper output for captions. When using transcoded streams with captions, we recommend creating a Synchronized Multimedia Integration Language (SMIL) file instead. For more, see Understanding SMIL file syntax, Play live streams with WebVTT subtitles, and Create a SMIL file for live streaming using a text editor. You can also reference the following sample SMIL file, which outputs WebVTT subtitles with your video tracks.
<?xml version="1.0" encoding="UTF-8"?> <smil title="SMIL file for live streaming"> <head></head> <body> <switch> <video src="myStream_160p" width="284" height="160"> <param name="videoBitrate" value="105000" valuetype="data"/> <param name="audioBitrate" value="44100" valuetype="data"/> <param name="cupertinoTag.SUBTITLES" value="subs" valuetype="data"/> </video> <video src="myStream_360p" width="640" height="360"> <param name="videoBitrate" value="365000" valuetype="data"/> <param name="audioBitrate" value="44100" valuetype="data"/> <param name="cupertinoTag.SUBTITLES" value="subs" valuetype="data"/> </video> <!-- Add caption data --> <textstream src="myStream_360p" system-language="eng,kor"> <param name="iswowzacaptionstream" value="true" valuetype="data"/> <param name="cupertinoTag.TYPE" value="SUBTITLES" valuetype="data"/> <param name="cupertinoTag.GROUP-ID" value="subs" valuetype="data"/> <param name="cupertinoTag.DEFAULT" value="YES" valuetype="data"/> <param name="cupertinoTag.FORCED" value="NO" valuetype="data"/> </textstream> </switch> </body> </smil>
- Restart Wowza Streaming Engine.
- Continue to the Enable the module and Configure module properties sections.
Enable the module
To enable this module, add the following module definition to your application configuration. See Configure modules for details.
Name
|
Description
|
Fully qualified class name
|
whisperSpeechToText | WhisperSpeechToText | com.wowza.wms.plugin.captions.ModuleWhisperCaptions |
Configure module properties
After enabling the module, you can adjust the default settings by adding the following Custom properties to your live application. See Configure properties for details.
Required properties
Path
|
Name
|
Type
|
Value
|
Description |
/Root/Application | whisperCaptionsEnabled | Boolean | true | If the whisperSpeechToText module is configured, set this property to enable it. The default value is false. |
/Root/Application | whisperSocketHost | String | localhost | Specify the hostname or IP address where the Whisper service is hosted. |
/Root/Application | whisperSocketPort | String | 3000 | Specify the network port on which the Whisper server is actively listening for incoming connections. |
Optional properties
Path | Name | Type | Value | Description |
/Root/Application | captionHandlerDebug | Boolean | true | Enables extra debug logging for troubleshooting. |
/Root/Application | captionHandlerStreamDelay | String | 10000 | Defines the delay between the source stream and output stream in milliseconds. The default value is 30000 (or 30 seconds). |
Configure captioning properties
The whisperSpeechToText module enables WebVTT captions and defaults to the detected language. If you plan to use embedded captions, such as CEA-608/708, you have to disable the captionLiveIngestLanguages closed-captioning property.
- From the Properties tab of your Wowza Streaming Engine live application, click Closed Captions.
- Click Edit.
- Disable the captionLiveIngestLanguages property.
- Click Save.
- Restart your application.
- See Configure closed captioning for Wowza Streaming Engine live streams for more information.
Set up a Whisper server
If you're not using the Docker workflow to preview this module, you must independently set up a Whisper server to process audio data and return captions. We provide this whisper_streaming GitHub repository with a Docker container to run a standalone Whisper service. This project builds upon this resource. To run the Whisper server, check the following sections.
Without GPU support
To run a Whisper server without GPU support, follow these steps.
- Install Docker Desktop, which includes the Docker Engine and the Docker Compose plugin.
- Make sure Docker Desktop and Docker Engine are running.
- Clone the whisper_streaming repo:
git clone git@github.com:WowzaMediaSystems/whisper_streaming.git
- Change the directory to the whisper_streaming repo:
cd whisper_streaming
- Ensure that the image for the whisper_server service in the local docker-compose.yaml file is set to wowza/whisper_streaming:latest:
name: Whisper Streaming services: whisper_server: hostname: whisper.server image: whisper_streaming:latest
- From your local whisper_streaming repo, run:
docker compose up
With GPU support (optional)
To run a Whisper server and enable GPU support, follow these steps.
- Install Docker Desktop, which includes the Docker Engine and the Docker Compose plugin.
- Make sure Docker Desktop and Docker Engine are running.
- Clone the whisper_streaming repo:
git clone git@github.com:WowzaMediaSystems/whisper_streaming.git
- Change the directory to the whisper_streaming repo:
cd whisper_streaming
- In the local docker-compose.yaml file, uncomment and update the following lines for the whisper_server service. Ensure that the image for the whisper_server service is set to wowza/whisper_streaming:latest-gpu:
name: Whisper Streaming services: whisper_server: hostname: whisper.server image: whisper_streaming:latest-gpu build: context: . # dockerfile: Dockerfile.jetson runtime: nvidia deploy: resources: reservations: devices: - driver: nvidia capabilities: [gpu] count: 1
Notes:
- The wowza/whisper_streaming:latest-gpu image on Docker Hub is preconfigured and optimized for GPU acceleration. It includes all necessary settings and dependencies to enable GPU support out of the box.
- If you're not using the preconfigured wowza/whisper_streaming:latest-gpu image and are instead building the image locally, make sure to update this Dockerfile before building. You'll need to uncomment or modify the following lines to enable GPU support so the image is configured correctly for your setup.
# Install these for GPU, increases image size by ~5GB RUN pip install torch RUN pip install "triton>=2.0.0; platform_machine=='x86_64' and (sys_platform=='linux' or sys_platform=='linux2')" RUN wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb RUN dpkg -i cuda-keyring_1.1-1_all.deb RUN apt update && apt install cudnn9-cuda-12 -y
- In the local docker-compose.yaml file, uncomment and update the following environment variables for the whisper_server service:
- MODEL=large-v3-turbo - USE_GPU=true - FP16=true
Notes:
- We recommend using the large-v3-turbo Whisper model when enabling GPU processing.
- The USE_GPU boolean enables GPU acceleration for faster inference.
- The FP16 variable uses 16-bit floating point precision for faster processing only on GPUs.
- From your local whisper_streaming repo, run:
docker compose up
- Verify GPU usage with the following command. You should see the Python process consuming GPU memory and computer resources:
watch -n 1 nvidia-smi
Test playback
Use the steps in this section to publish your source stream to Wowza Streaming Engine and to verify that the module is working as expected.
- Start a stream and send it to your Wowza Streaming Engine server using the following server/port and stream key. For more about publishing live streams, see Connect a live source to Wowza Streaming Engine.
rtmp://[server-ip-address]:1935/[application-name]/myStream
- Check the Incoming Streams page for your live stream, where the output looks similar to this:
Note: The [stream-name]_160p, [stream_name]_360p, and [stream-name]_source renditions include WebVTT captions. They're transcoded versions of the [stream-name]_delayed stream.
- Go to our Wowza Test Player to test playback with the automatically generated WebVTT captions using the following URL:
http://[server-ip-address]:[port]/[application-name]/myStream_delayed/playlist.m3u8
If using a SMIL file:
http://[server-ip-address]:[port]/[application-name]/smil:myStream.smil/playlist.m3u8