Load balance NVIDIA accelerated transcoding across GPUs with the Wowza Streaming Engine Java API

This article describes how to use the ITranscoderVideoLoadBalancer interface in the Wowza Streaming Engine Transcoder to balance NVIDIA accelerated transcoding across multiple GPUs. The load balancing is done at the server level.

Notes:
  • Wowza Streaming Engine 4.8.0 has a known issue with GPU performance when using NVIDIA hardware-accelerated decoding. For more information, see Wowza Streaming Engine 4.8.0 Release Notes. To address this issue, use Wowza Streaming Engine 4.8.5 or later.
  • Wowza Streaming Engine™ 4.6.0 or later is required.

The ITranscoderVideoLoadBalancer interface is invoked during transcoder execution:

public interface ITranscoderVideoLoadBalancer
{
	public abstract void init(IServer server, TranscoderContextServer transcoderContextServer);
	public abstract void onHardwareInspection(TranscoderContextServer transcoderContextServer);
	public abstract void onTranscoderSessionCreate(LiveStreamTranscoder liveStreamTranscoder);
	public abstract void onTranscoderSessionInit(LiveStreamTranscoder liveStreamTranscoder);
	public abstract void onTranscoderSessionDestroy(LiveStreamTranscoder liveStreamTranscoder);
	public abstract void onTranscoderSessionLoadBalance(LiveStreamTranscoder liveStreamTranscoder);
}

Where:

  • init is invoked when server first starts.
  • onHardwareInspection is invoked when transcoder is started and the server hardware is inspected for accelerated hardware resources.
  • onTranscoderSessionCreate is invoked when the transcoder session is created.
  • onTranscoderSessionInit is invoked after the transcoder session is fully initialized and the transcoder template has been read.
  • onTranscoderSessionDestroy is invoked when the transcoder session is destroyed.
  • onTranscoderSessionLoadBalance is invoked when the transcoder session is in the state just before the decoder, scaler, and encoder are initialized.

Configure the ITranscodeVideoLoadBalancer interface

To use the ITranscoderVideoLoadBalancer interface, you must do the following:

  1. Create a custom class that extends the TranscoderVideoLoadBalancerBase (which inherits the ITranscoderVideoLoadBalancer interface described above) and overrides the methods to do the load balancing. For more information, see Example class - TranscoderVideoLoadBalancerCUDASimple.
     
  2. Then add the following server-level property that points to your custom implementation to the [install-dir]/conf/Server.xml file:
    <Property>
    	<Name>transcoderVideoLoadBalancerClass</Name>
    	<Value>[custom-class-path]</Value>
    </Property>

    Where the [custom-class-path] is the path pointing to your custom implementation. For example, if you're using the built-in TranscoderVideoLoadBalancerCUDASimple example class, below, you'd add the following:
    <Property>
    	<Name>transcoderVideoLoadBalancerClass</Name>
    	<Value>com.wowza.wms.transcoder.model.TranscoderVideoLoadBalancerCUDASimple</Value>
    </Property>
  3. (Optional) If you're load balancing against GPUs with unequal capabilities, you can use the transcoderVideoLoadBalancerCUDASimpleGPUWeights property to specify weight factors for each card based on their capabilities relative to the highest-performing card. See Load balancing across unequal GPUs.

Example class - TranscoderVideoLoadBalancerCUDASimple

The TranscoderVideoLoadBalancerCUDASimple class is built into Wowza Streaming Engine (4.5.0.01 and later). It can be used without any additional coding to do fairly simplistic load balancing of complete transcoder sessions across multiple GPUs. It load balances complete sessions because doing so is required if you want to do accelerated scaling (decoding, scaling, and encoding must be done on the same GPU so that frame data can stay on the device).

import java.util.*;

import com.wowza.util.*;
import com.wowza.wms.application.*;
import com.wowza.wms.logging.*;
import com.wowza.wms.media.model.*;
import com.wowza.wms.server.*;

public class TranscoderVideoLoadBalancerCUDASimple extends TranscoderVideoLoadBalancerBase
{
	private static final Class<TranscoderVideoLoadBalancerCUDASimple> CLASS = TranscoderVideoLoadBalancerCUDASimple.class;
	private static final String CLASSNAME = "TranscoderVideoLoadBalancerCUDASimple";

	public static final int DEFAULT_GPU_WEIGHT_SCALE = 1;
	public static final int DEFAULT_WEIGHT_FACTOR_ENCODE = 5;
	public static final int DEFAULT_WEIGHT_FACTOR_DECODE = 1;
	public static final int DEFAULT_WEIGHT_FACTOR_SCALE = 1;

	public static final int LOAD_MAG = 1000;

	public static final String PROPNAME_TRANSCODER_SESSION = "TranscoderVideoLoadBalancerCUDASimpleSessionInfo";

	class SessionInfo
	{
		private int gpuid = 0;
		private long load = 0;

		public SessionInfo(int gpuid, long load)
		{
			this.gpuid = gpuid;
			this.load = load;
		}
	}

	class GPUInfo
	{
		private int gpuid = 0;
		private long currentLoad = 0;
		private int weight = 0;

		private int getWeight()
		{
			return this.weight;
		}

		private long getUnWeightedLoad()
		{
			return currentLoad;
		}

		private long getWeightedLoad()
		{
			long load = 0;
			if (weight > 0)
				load = (currentLoad*gpuWeightScale)/weight;
			return load;
		}
	}

	private Object lock = new Object();
	private TranscoderContextServer transcoderContextServer = null;
	private boolean available = false;
	private int countGPU = 0;
	private int gpuWeightScale = DEFAULT_GPU_WEIGHT_SCALE;
	private int[] gpuWeights = null;
	private int weightFactorEncode = DEFAULT_WEIGHT_FACTOR_ENCODE;
	private int weightFactorDecode = DEFAULT_WEIGHT_FACTOR_DECODE;
	private int weightFactorScale = DEFAULT_WEIGHT_FACTOR_SCALE;
	private GPUInfo[] gpuInfos = null;

	@Override
	public void init(IServer server, TranscoderContextServer transcoderContextServer)
	{
		this.transcoderContextServer = transcoderContextServer;

		WMSProperties props = server.getProperties();

		this.weightFactorEncode = props.getPropertyInt("transcoderVideoLoadBalancerCUDASimpleWeightFactorEncode", this.weightFactorEncode);
		this.weightFactorDecode = props.getPropertyInt("transcoderVideoLoadBalancerCUDASimpleWeightFactorDecode", this.weightFactorDecode);
		this.weightFactorScale = props.getPropertyInt("transcoderVideoLoadBalancerCUDASimpleWeightFactorScale", this.weightFactorScale);

		String weightsStr = props.getPropertyStr("transcoderVideoLoadBalancerCUDASimpleGPUWeights", null);
		if (weightsStr != null)
		{
			String[] values = weightsStr.split(",");
			int maxWeight = 0;
			this.gpuWeights = new int[values.length];
			for(int i=0;i<values.length;i++)
			{
				String value = values[i].trim();
				if (value.length() <= 0)
				{
					this.gpuWeights[i] = -1;
					continue;
				}

				int weight = -1;
				try
				{
					weight = Integer.parseInt(value);
					if (weight < 0)
						weight = 0;
				}
				catch(Exception e)
				{
				}

				this.gpuWeights[i] = weight;
				if (weight > maxWeight)
					maxWeight = weight;
			}

			this.gpuWeightScale = maxWeight;
			for(int i=0;i<this.gpuWeights.length;i++)
			{
				if (this.gpuWeights[i] < 0)
					this.gpuWeights[i] = this.gpuWeightScale;
			}
		}

		WMSLoggerFactory.getLogger(CLASS).info(CLASSNAME+".init: weightFactorEncode:"+weightFactorEncode+" weightFactorDecode:"+weightFactorDecode+" weightFactorScale:"+weightFactorScale);
	}

	@Override
	public void onTranscoderSessionCreate(LiveStreamTranscoder liveStreamTranscoder)
	{
	}

	@Override
	public void onTranscoderSessionInit(LiveStreamTranscoder liveStreamTranscoder)
	{
	}

	@Override
	public void onTranscoderSessionDestroy(LiveStreamTranscoder liveStreamTranscoder)
	{
		if (this.countGPU > 1)
		{
			WMSProperties props = liveStreamTranscoder.getProperties();
			Object sessionInfoObj = props.get(PROPNAME_TRANSCODER_SESSION);
			if (sessionInfoObj != null && sessionInfoObj instanceof SessionInfo)
			{
				SessionInfo sessionInfo = (SessionInfo)sessionInfoObj;

				if (sessionInfo.gpuid < gpuInfos.length)
				{
					synchronized(this.lock)
					{
						gpuInfos[sessionInfo.gpuid].currentLoad -= sessionInfo.load;
						sessionInfo.load = 0;
					}

					WMSLoggerFactory.getLogger(CLASS).info(CLASSNAME+".onTranscoderSessionDestroy["+liveStreamTranscoder.getContextStr()+"]: Removing GPU session: gpuid:"+sessionInfo.gpuid+" load:"+sessionInfo.load);
				}
			}
		}
	}

	@Override
	public void onHardwareInspection(TranscoderContextServer transcoderContextServer)
	{
		//{"infoCUDA":{"availabe":true,"availableFlags":65651,"countGPU":1,"driverVersion":368.81,"cudaVersion":8000,"isCUDAOldH264WindowsAvailable":false,"gpuInfo":[{"name":"GeForce GTX 960M","versionMajor":5,"versionMinor":0,"clockRate":1097500,"multiprocessorCount":5,"totalMemory":2147483648,"coreCount":640,"isCUDANVCUVIDAvailable":true,"isCUDAH264EncodeAvailable":true,"isCUDAH265EncodeAvailable":false,"getCUDANVENCVersion":5}]},"infoQuickSync":{"availabe":true,"availableFlags":537,"versionMajor":1,"versionMinor":19,"isQuickSyncH264EncodeAvailable":true,"isQuickSyncH265EncodeAvailable":true,"isQuickSyncVP8EncodeAvailable":false,"isQuickSyncVP9EncodeAvailable":false,"isQuickSyncH264DecodeAvailable":true,"isQuickSyncH265DecodeAvailable":false,"isQuickSyncMP2DecodeAvailable":true,"isQuickSyncVP8DecodeAvailable":false,"isQuickSyncVP9DecodeAvailable":false},"infoVAAPI":{"available":false},"infoX264":{"available":false},"infoX265":{"available":false}}

		boolean available = false;
		int countGPU = 0;

		String jsonStr = transcoderContextServer.getHardwareInfoJSON();
		if (jsonStr != null)
		{
			try
			{
				JSON jsonData = new JSON(jsonStr);

				if (jsonData != null)
				{
					Map<String, Object> entries = jsonData.getEntrys();

					Map<String, Object> infoCUDA = (Map<String, Object>)entries.get("infoCUDA");
					if (infoCUDA != null)
					{

						Object availableObj = infoCUDA.get("availabe");
						if (availableObj != null && availableObj instanceof Boolean)
						{
							available = ((Boolean)availableObj).booleanValue();
						}

						if (available)
						{
							Object countGPUObj = infoCUDA.get("countGPU");
							if (countGPUObj != null && countGPUObj instanceof Integer)
							{
								countGPU = ((Integer)countGPUObj).intValue();
							}
						}
					}
				}
			}
			catch(Exception e)
			{
				WMSLoggerFactory.getLogger(CLASS).info(CLASSNAME+".onHardwareInspection: Parsing JSON: ", e);
			}
		}

		WMSLoggerFactory.getLogger(CLASS).info(CLASSNAME+".onHardwareInspection: CUDA available:"+available+" countGPU:"+countGPU);

		synchronized(lock)
		{
			this.available = available;
			this.countGPU = countGPU;

			if (this.countGPU > 1)
			{
				this.gpuInfos = new GPUInfo[this.countGPU];
				for(int i=0;i<this.gpuInfos.length;i++)
				{
					this.gpuInfos[i] = new GPUInfo();

					this.gpuInfos[i].gpuid = i;

					if (this.gpuWeights != null && i < this.gpuWeights.length)
						this.gpuInfos[i].weight = this.gpuWeights[i];
					else
						this.gpuInfos[i].weight = gpuWeightScale;
				}
			}
		}
	}

	@Override
	public void onTranscoderSessionLoadBalance(LiveStreamTranscoder liveStreamTranscoder)
	{
		try
		{
			while(true)
			{
				if (this.gpuInfos == null)
					break;

				TranscoderStream transcoderStream = liveStreamTranscoder.getTranscodingStream();
				if (transcoderStream == null)
					break;

				TranscoderSession transcoderSession = liveStreamTranscoder.getTranscodingSession();
				if (transcoderSession == null)
					break;

				TranscoderSessionVideo transcoderSessionVideo = transcoderSession.getSessionVideo();
				if (transcoderSessionVideo == null)
					break;

				MediaCodecInfoVideo codecInfoVideo = null;
				if (transcoderSessionVideo.getCodecInfo() != null)
					codecInfoVideo = transcoderSession.getSessionVideo().getCodecInfo();

				long loadDecode = 0;
				long loadScale = 0;
				long loadEncode = 0;

				boolean isScalerCUDA = false;

				TranscoderStreamSourceVideo transcoderStreamSourceVideo = null;
				TranscoderStreamScaler transcoderStreamScaler = null;

				TranscoderStreamSource transcoderStreamSource = transcoderStream.getSource();
				if (transcoderStreamSource != null)
				{
					transcoderStreamSourceVideo = transcoderStreamSource.getVideo();
					if (transcoderStreamSourceVideo != null && codecInfoVideo != null && (transcoderStreamSourceVideo.isImplementationNVCUVID() || transcoderStreamSourceVideo.isImplementationCUDA()))
					{
						loadDecode = codecInfoVideo.getFrameWidth() * codecInfoVideo.getFrameHeight();
					}
					else
						transcoderStreamSourceVideo = null;
				}

				transcoderStreamScaler = transcoderStream.getScaler();
				if (transcoderStreamScaler != null)
				{
					isScalerCUDA = transcoderStreamScaler.isImplementationCUDA();
				}

				List<TranscoderStreamDestination> destinations = transcoderStream.getDestinations();
				if (destinations == null)
					break;

				for(TranscoderStreamDestination destination : destinations)
				{
					if (!destination.isEnable())
						continue;

					TranscoderStreamDestinationVideo destinationVideo = destination.getVideo();

					if (destinationVideo == null)
						continue;

					if (destinationVideo.isPassThrough() || destinationVideo.isDisable())
						continue;

					TranscoderVideoFrameSizeHolder frameSizeHolder = destinationVideo.getFrameSizeHolder();
					if (frameSizeHolder == null)
						continue;

					if (isScalerCUDA)
						loadScale += frameSizeHolder.getActualWidth() * frameSizeHolder.getActualHeight();

					if (destinationVideo.isImplementationNVENC() || destinationVideo.isImplementationCUDA())
						loadEncode += frameSizeHolder.getActualWidth() * frameSizeHolder.getActualHeight();
				}

				long totalLoad = (loadDecode*weightFactorDecode) + (loadScale*weightFactorScale) + (loadEncode*weightFactorEncode);
				if (totalLoad <= 0)
					break;

				totalLoad /= LOAD_MAG;

				if (totalLoad <= 0)
					totalLoad = 1;

				int gpuid = -1;

				synchronized(lock)
				{
					long leastLoad = Long.MAX_VALUE;

					for(int i=0;i<gpuInfos.length;i++)
					{
						if (gpuInfos[i].getWeightedLoad() < leastLoad)
						{
							leastLoad = gpuInfos[i].getWeightedLoad();
							gpuid = i;
						}
					}

					if (gpuid >= 0)
						gpuInfos[gpuid].currentLoad += totalLoad;
				}

				WMSLoggerFactory.getLogger(CLASS).info(CLASSNAME+".onTranscoderSessionLoadBalance["+liveStreamTranscoder.getContextStr()+"]: gpuid:"+gpuid+" load:"+totalLoad+" [decode:"+loadDecode+" scale:"+loadScale+" encode:"+loadEncode+"]");

				if (gpuid >= 0)
				{
					liveStreamTranscoder.getProperties().put(PROPNAME_TRANSCODER_SESSION, new SessionInfo(gpuid, totalLoad));

					if (transcoderStreamSourceVideo != null)
						transcoderStreamSourceVideo.setGPUID(gpuid);

					if (transcoderStreamScaler != null && isScalerCUDA)
						transcoderStreamScaler.setGPUID(gpuid);

					for(TranscoderStreamDestination destination : destinations)
					{
						if (!destination.isEnable())
							continue;

						TranscoderStreamDestinationVideo destinationVideo = destination.getVideo();

						if (destinationVideo == null)
							continue;

						if (destinationVideo.isPassThrough() || destinationVideo.isDisable())
							continue;

						if (destinationVideo.isImplementationNVENC() || destinationVideo.isImplementationCUDA())
							destinationVideo.setGPUID(gpuid);
					}
				}
				break;
			}
		}
		catch(Exception e)
		{
			WMSLoggerFactory.getLogger(CLASS).info(CLASSNAME+".onTranscoderSessionLoadBalance: Parsing JSON: ", e);
		}
	}
}

Load balancing across unequal GPUs

The built-in TranscoderVideoLoadBalancerCUDASimple class has the ability to load balance across NVIDIA CUDA-based cards that aren't equally capable. If your CUDA cards do not have the same capability, you can specify weighting factors for each CUDA card. The weighting is specified using the Server property transcoderVideoLoadBalancerCUDASimpleGPUWeights. This property takes a comma-separated list of weighting factors for each of the CUDA cards. We recommend that you use a factor of 100 for the highest performing card and lower weighting factors for lower capacity cards. The lower weighting factors should be based on a card's capacity relative to the highest performing card. The CUDA card order is logged in the server logs on Wowza Streaming Engine startup, and the comma-separated list of weighting factors must be in the same order. For example, if you have a M5000 card (index 0) and an M2000 card (index 1) in your server, the transcoderVideoLoadBalancerCUDASimpleGPUWeights factor might be set as follows:

<Property>
	<Name>transcoderVideoLoadBalancerCUDASimpleGPUWeights</Name>
	<Value>100,66</Value>
</Property>

This indicates that the M2000 card has 66 percent the capacity of the M5000 card.