Convert speech to text to generate live stream captions with OpenAI Whisper

The whisperSpeechToText module for Wowza Streaming Engine™ media server software can be used to receive audio from an incoming source stream and send that raw audio to OpenAI Whisper. Whisper's speech recognition service processes the audio data and returns captions for display alongside your live stream. For available models and languages, see the whisper project on GitHub.

The module automatically enables captions for WebVTT output, which we generally recommend. However, it's also possible to configure it for CEA-608/708 captions. When used with the Whisper service, the module is capable of transcribing audio into captions. It can also translate the source audio into different language tracks.

Translations are accomplished using the LibreTranslate open-source machine translation API, which allows you to translate text between languages without relying on proprietary cloud services. The project is written in Python and built on Argos Translate, supporting the following languages.

You can get the whisperSpeechToText source code from the wse-plugin-caption-handlers repository on GitHub.

Prerequisites

To work with the whisperSpeechToText module, you must meet the following prerequisites:

You must have Wowza Streaming Engine 4.9.4 or later installed and use Java 21.
If you plan to preview the module using Docker Compose, install and run Docker Desktop.
If you're not using Docker Compose to preview the module, you need to manually set up your own Whisper server.

Usage

You can preview the whisperSpeechToText module using our Docker Compose deployment, or you can manually install the module in your existing Wowza Streaming Engine installation. Select one of the following workflows depending on your use case:

Preview the module with Docker Compose
Set up the module without Docker Compose

A successful setup utilizes Whisper's automatic speech recognition (ASR) system to convert audio from a source stream into text, which is then injected into the Wowza Streaming Engine live stream as onTextData. Once the onTextData is inserted into the stream, you can configure Wowza Streaming Engine to output CEA-608/708 or WebVTT captions.

For most modern use cases, we recommend using WebVTT captions, as they offer rich styling and customization options, full UTF-8 encoding for internationalization, and native support in multiple browsers and players.

Preview the module with Docker Compose

To preview this module, you can use our docker-compose.yaml deployment. We describe a similar process in the Set up Wowza Streaming Engine using a Docker Compose deployment article, where you can find additional information about environment variables.

This Docker Compose workflow is pre-configured to start a Wowza Streaming Engine instance with the whisperSpeechToText module installed and set up to leverage Whisper's ASR services. It also installs and sets up a Whisper server to automatically detect the audio input and transcribes it into WebVTT captions using the detected language.

Note: If you're trying to manually add the module to an existing installation of Wowza Streaming Engine, continue with the Install the module section instead. When manually installing the module, you have to set up your own Whisper server.

To use the Docker Compose preview deployment, follow these steps.

Install Docker Desktop, which includes the Docker Engine and the Docker Compose plugin.
Make sure Docker Desktop and Docker Engine are running.
Clone the wse-plugin-caption-handlers repo:

 git clone git@github.com:WowzaMediaSystems/wse-plugin-caption-handlers.git

Change the directory to the wse-plugin-caption-handlers repo:

 cd wse-plugin-caption-handlers

Update the WSE_LICENSE_KEY variable in the local docker-compose.yaml file with your Wowza Streaming Engine key:

 export WSE_LICENSE_KEY=[your-license-key]

Note: If you set the license key using the described method, it doesn't persist between terminal sessions and each time you run the Docker container or reboot your server. For a more consistent experience, you can directly add the license key to the docker-compose.yaml file or use a .env file to store sensitive data.

(Optional) To enable translations via the LibreTranslate translation service:

In the local docker-compose.yaml file, uncomment these lines for the libretranslate_server service:

 libretranslate_server:
   hostname: libretranslate.server
   image: libretranslate/libretranslate:latest
   # image: libretranslate/libretranslate:latest-cuda
   environment:
     - LT_LOAD_ONLY=en,fr,es,de,ja
   ports:
     - 5001:5000
   volumes:
     - /tmp/libretranslate_models_cache:/home/libretranslate/.local

In the local docker-compose.yaml file, uncomment and update the following environment variables for the whisper_server service:

 - SOURCE_LANGUAGE=en
 - REPORT_LANGUAGES=en,fr,es,de,ja
 - TRANSLATE_HOST=libretranslate.server
 - TRANSLATE_PORT=5000

Notes:

The SOURCE_LANGUAGE specifies the input language code used for translation. This value defines the language of the text or audio to be translated. Use a valid ISO 639-1 two-letter language code to specify the language.

The REPORT_LANGUAGES environment variable defines the target languages to be used for translation. To ensure consistency between loaded models and reporting output, it must include the same language codes specified in the LT_LOAD_ONLY environment variable. Use a valid ISO 639-1 two-letter language code to specify the language.

The libretranslate/libretranslate:latest-cuda image can be used if you're planning to enable GPU acceleration with translations.

In the local Application.xml file, uncomment and update the following lines, making sure the required language codes are included:

 <TimedText>
   <!-- Properties for TimedText -->
     <Properties>
       <Property>
 	 <Name>captionLiveIngestLanguages</Name>
	   <Value>en,fr,es,de,ja</Value>
       </Property>
     </Properties>
 </TimedText>

(Optional) To enable GPU processing and take advantage of hardware acceleration:

In the local docker-compose.yaml file, uncomment and update these lines for the whisper_server service:

 whisper_server:
   hostname: whisper.server
   image: wowza/whisper_streaming:latest-gpu
   runtime: nvidia
   deploy:
     resources:
       reservations:
         devices:
           - driver: nvidia
             capabilities: [gpu]
             count: 1

Note: The wowza/whisper_streaming:latest-gpu image on Docker Hub is preconfigured and optimized for GPU acceleration. It includes all necessary settings and dependencies to enable GPU support out of the box.

In the local docker-compose.yaml file, uncomment and update the following environment variables for the whisper_server service:

 - MODEL=large-v3-turbo
 - USE_GPU=true
 - FP16=true

Notes:

We recommend using the large-v3-turbo Whisper model when enabling GPU processing.

The USE_GPU boolean enables GPU acceleration for faster inference.

The FP16 variable uses 16-bit floating point precision for faster processing only on GPUs.

From your local wse-plugin-caption-handlers repo, run:

 docker compose up

Open a new browser tab and go to:

 http://localhost:8088/login.htm?host=http://wse.docker:8087

Note: When you click the Server link, confirm the http://wse.docker:8087 URL displays.

Log in to Wowza Streaming Engine using the credentials from the docker-compose.yaml file.
Go to Applications and click the whisper application.
Check the Modules tab for the whisper application, which includes the whisperSpeechToText module.
Go to the Properties tab and view the Custom properties. They are pre-configured to work with the Whisper ASR service.
Start a stream and send it to your Wowza Streaming Engine server using the following server and stream key. For more about publishing live streams, see Connect a live source to Wowza Streaming Engine.

rtmp://wse-demo.wowza.com/whisper/myStream

To test playback and see the automatically generated WebVTT captions, go to our Wowza Test Player and use this URL:

http://localhost:1935/whisper/myStream_delayed/playlist.m3u8

Set up the module without Docker Compose

If you already have Wowza Streaming Engine installed and don't plan to use the Docker Compose deployment to preview the pre-configured whisperSpeechToText module, you can install and configure the standalone module using the steps in this section.

Install the module

To manually install the standalone module without using our Dockerized solution, follow these steps.

Download the wse-plugin-caption-handlers-[version].jar file from the latest plugin release version.
Copy the wse-plugin-caption-handlers-[version].jar file to the [install-dir]/ lib folder in your Wowza Streaming Engine installation.
Enable the Wowza Streaming Engine Transcoder for your live application.

Notes:

The Transcoder must be enabled to resample the audio and send it in a specific format to the Whisper service.

To bypass transcoding of the final output, set the Fallback Template to None and remove any named templates matching the stream name.

Stream name groups do not provide the proper output for captions. When using transcoded streams with captions, we recommend creating a Synchronized Multimedia Integration Language (SMIL) file instead. For more, see Understanding SMIL file syntax, Play live streams with WebVTT subtitles, and Create a SMIL file for live streaming using a text editor. You can also reference the following sample SMIL file, which outputs WebVTT subtitles with your video tracks.

<?xml version="1.0" encoding="UTF-8"?>
<smil title="SMIL file for live streaming">
    <head></head>
    <body>
        <switch>
            <video src="myStream_160p" width="284" height="160">
                <param name="videoBitrate" value="105000" valuetype="data"/>
                <param name="audioBitrate" value="44100" valuetype="data"/>
                <param name="cupertinoTag.SUBTITLES" value="subs" valuetype="data"/>
            </video>
            <video src="myStream_360p" width="640" height="360">
                <param name="videoBitrate" value="365000" valuetype="data"/>
                <param name="audioBitrate" value="44100" valuetype="data"/>
                <param name="cupertinoTag.SUBTITLES" value="subs" valuetype="data"/>
            </video>

            <!-- Add caption data -->
            <textstream src="myStream_360p" system-language="eng,kor">
                <param name="iswowzacaptionstream" value="true" valuetype="data"/>
                <param name="cupertinoTag.TYPE" value="SUBTITLES" valuetype="data"/>
                <param name="cupertinoTag.GROUP-ID" value="subs" valuetype="data"/>
                <param name="cupertinoTag.DEFAULT" value="YES" valuetype="data"/>
                <param name="cupertinoTag.FORCED" value="NO" valuetype="data"/>
            </textstream>
        </switch>
    </body>
</smil>

Restart Wowza Streaming Engine.
Continue to the Enable the module and Configure module properties sections.

Enable the module

To enable this module, add the following module definition to your application configuration. See Configure modules for details.

Name	Description	Fully qualified class name
whisperSpeechToText	WhisperSpeechToText	com.wowza.wms.plugin.captions.ModuleWhisperCaptions

Configure module properties

After enabling the module, you can adjust the default settings by adding the following Custom properties to your live application. See Configure properties for details.

Required properties

Path	Name	Type	Value	Description
/Root/Application	whisperCaptionsEnabled	Boolean	true	If the whisperSpeechToText module is configured, set this property to enable it. The default value is false.
/Root/Application	whisperSocketHost	String	localhost	Specify the hostname or IP address where the Whisper service is hosted.
/Root/Application	whisperSocketPort	String	3000	Specify the network port on which the Whisper server is actively listening for incoming connections.

Optional properties

Path	Name	Type	Value	Description
/Root/Application	captionHandlerDebug	Boolean	true	Enables extra debug logging for troubleshooting.
/Root/Application	captionHandlerStreamDelay	String	10000	Defines the delay between the source stream and output stream in milliseconds. The default value is 30000 (or 30 seconds).

Configure captioning properties

The whisperSpeechToText module enables WebVTT captions and defaults to the detected language. If you plan to use embedded captions, such as CEA-608/708, you have to disable the captionLiveIngestLanguages closed-captioning property.

From the Properties tab of your Wowza Streaming Engine live application, click Closed Captions.
Click Edit.
Disable the captionLiveIngestLanguages property.
Click Save.
Restart your application.
See Configure closed captioning for Wowza Streaming Engine live streams for more information.

Set up a Whisper server

If you're not using the Docker workflow to preview this module, you must independently set up a Whisper server to process audio data and return captions. We provide this whisper_streaming GitHub repository with a Docker container to run a standalone Whisper service. This project builds upon this resource. To run the Whisper server, check the following sections.

Without GPU support

To run a Whisper server without GPU support, follow these steps.

Install Docker Desktop, which includes the Docker Engine and the Docker Compose plugin.
Make sure Docker Desktop and Docker Engine are running.
Clone the whisper_streaming repo:

 git clone git@github.com:WowzaMediaSystems/whisper_streaming.git

Change the directory to the whisper_streaming repo:

 cd whisper_streaming

Ensure that the image for the whisper_server service in the local docker-compose.yaml file is set to wowza/whisper_streaming:latest:

 name: Whisper Streaming

 services:
   whisper_server:
     hostname: whisper.server
     image: whisper_streaming:latest

From your local whisper_streaming repo, run:

 docker compose up

With GPU support (optional)

To run a Whisper server and enable GPU support, follow these steps.

Install Docker Desktop, which includes the Docker Engine and the Docker Compose plugin.
Make sure Docker Desktop and Docker Engine are running.
Clone the whisper_streaming repo:

 git clone git@github.com:WowzaMediaSystems/whisper_streaming.git

Change the directory to the whisper_streaming repo:

 cd whisper_streaming

In the local docker-compose.yaml file, uncomment and update the following lines for the whisper_server service. Ensure that the image for the whisper_server service is set to wowza/whisper_streaming:latest-gpu:

 name: Whisper Streaming

 services:
   whisper_server:
     hostname: whisper.server
     image: whisper_streaming:latest-gpu
     build:
       context: .
       # dockerfile: Dockerfile.jetson
     runtime: nvidia
     deploy:
       resources:
         reservations:
           devices:
             - driver: nvidia
               capabilities: [gpu]
               count: 1

Notes:

The wowza/whisper_streaming:latest-gpu image on Docker Hub is preconfigured and optimized for GPU acceleration. It includes all necessary settings and dependencies to enable GPU support out of the box.

If you're not using the preconfigured wowza/whisper_streaming:latest-gpu image and are instead building the image locally, make sure to update this Dockerfile before building. You'll need to uncomment or modify the following lines to enable GPU support so the image is configured correctly for your setup.

# Install these for GPU, increases image size by ~5GB
RUN pip install torch
RUN pip install "triton>=2.0.0; platform_machine=='x86_64' and (sys_platform=='linux' or sys_platform=='linux2')"
RUN wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
RUN dpkg -i cuda-keyring_1.1-1_all.deb
RUN apt update && apt install cudnn9-cuda-12 -y

In the local docker-compose.yaml file, uncomment and update the following environment variables for the whisper_server service:

 - MODEL=large-v3-turbo
 - USE_GPU=true
 - FP16=true

Notes:

We recommend using the large-v3-turbo Whisper model when enabling GPU processing.

The USE_GPU boolean enables GPU acceleration for faster inference.

The FP16 variable uses 16-bit floating point precision for faster processing only on GPUs.

From your local whisper_streaming repo, run:

 docker compose up

Verify GPU usage with the following command. You should see the Python process consuming GPU memory and computer resources:

 watch -n 1 nvidia-smi

Test playback

Use the steps in this section to publish your source stream to Wowza Streaming Engine and to verify that the module is working as expected.

Start a stream and send it to your Wowza Streaming Engine server using the following server/port and stream key. For more about publishing live streams, see Connect a live source to Wowza Streaming Engine.

rtmp://[server-ip-address]:1935/[application-name]/myStream

Check the Incoming Streams page for your live stream, where the output looks similar to this:

Cut

Note: The [stream-name]_160p, [stream_name]_360p, and [stream-name]_source renditions include WebVTT captions. They're transcoded versions of the [stream-name]_delayed stream.

Go to our Wowza Test Player to test playback with the automatically generated WebVTT captions using the following URL:

http://[server-ip-address]:[port]/[application-name]/myStream_delayed/playlist.m3u8

If using a SMIL file:

http://[server-ip-address]:[port]/[application-name]/smil:myStream.smil/playlist.m3u8

Convert speech to text to generate live stream captions with OpenAI Whisper

Prerequisites

Usage

Preview the module with Docker Compose

Set up the module without Docker Compose

Install the module

Enable the module

Configure module properties

Required properties

Optional properties

Configure captioning properties

Set up a Whisper server

Without GPU support

With GPU support (optional)

Test playback

More resources

Popular Video Topics

Video Resources

Partners

Company

Stay Connected

Stay Up to Date with the Blog