Video Formats for Live StreamingApril 8, 2020
Understanding How to Acquire and Deliver Live Streams
Last month, I covered video formats for use with video on demand (VOD) assets. Video formats are the containers used to combine and store compressed audio and video streams — sometimes referred to as primary streams, especially in the case of standards-based formats, such as MPEG-2 and MPEG-4 — for transmission to the content delivery network (CDN) for further distribution to end users.
What are the steps in choosing a video format for live streaming? I’d recommend accounting for three inflection points.
The first inflection point worth considering is how much latency is acceptable.
The longer the latency, the less chance for viewers to synchronize their feedback (through polling or chat) with the visuals that are occurring on screen. This delay is perfectly suitable for traditional broadcast, where real-time viewer feedback is either unimportant or unnecessary. Alternatively, shorter latency allows you to synchronize feedback, but also requires the use a CDN even for modest distribution.
We’ll cover the options below, but the rule of thumb is that some live streaming protocols, based on Hypertext Transfer Protocol (HTTP) delivery, scale well without the need for specialized servers — but introduce longer latencies in the process. Other live streaming protocols, based on UDP, offer lower latencies, even to the point of allowing visual-based interactivity (e.g., video conferencing like FaceTime, Skype, or Zoom) but don’t scale well without a CDN infrastructure.
The second inflection point is which devices you expect you’ll be delivering your streams to. For instance, Apple iPhone and iPad devices natively only support HTTP delivery, while set-top boxes, older smart TVs, and most computers support both HTTP and RTP-based streaming delivery.
Live Streaming Protocols
The final inflection point is understanding how the underlying protocols for live streaming via either RTP or HTTP affect format choices. While these protocols aren’t formats in and of themselves, it’s important to understand how they affect the choice of format.
Note that all true streaming protocols are based around the Real-Time Protocol (or RTP), which itself is delivered via an internet protocol called UDP. The combination of RTP and UDP allows very low-latency encoding (egress) and delivery (ingress), but the tradeoff is that reaching an audience bigger than 10 viewers will invariably require a CDN. Additionally, UDP is a bit aggressive on networks and often requires a specific port to be opened on routers — which is one reason that RTP-based encoding and delivery aren’t in use on many enterprise networks.
Rather than requiring a specific port on the router to be opened up, newer “streaming” protocols use the highly popular and universally open port 80 which allows Hypertext Transfer Protocol (HTTP) to pass into an enterprise network. While HTTP initially was only used by web browsers for text and image delivery to a web browser, today it is used extensively to deliver audio and video.
Based on the three inflection points above, let’s take a look at several protocols and how they relate to video container formats.
RTMP (Real-Time Messaging Protocol):
RTMP was created by Macromedia for its Director product and the subsequent SWF (Shockwave) web interactivity format for game and video delivery. While it had the standard benefits of RTP, such as timestamps (for synchronization) and sequence numbers (for packet loss and reordering detection), one of the main benefits of RTMP is the fact that it can provide decently low latency delivery via TCP — a less-aggressive internet protocol that requires a small bit of latency so that the sender and receiver can confirm that content has been delivered (often referred to as a handshake) — rather than relying fully on UDP.
RTMP is still in wide use today, years after Adobe deprecated Flash, and it has been advanced to handle content based on industry-standard video codecs such as AVC (also known as H.264 or MPEG-4 Part 10). Given the nature of its real-time delivery as a specialized protocol, however, RTMP is almost exclusively used on the egress (encoding) side of the media equation.
For most streaming delivery, RTMP is used to send a single low-latency stream to an on-premise media server or cloud-based transcoder, where it is converted and repackaged for consumption as part of an HTTP delivery workflow. In these cases, the container format for delivery is specified upon repackaging.
Another low-latency technology that is growing in popularity is WebRTC, where RTC stands for Real-Time Communication. WebRTC uses the more aggressive UDP internet protocol to keep latencies at a bare minimum, but it’s designed to work natively in a web browser, eliminating the need for a plug-in architecture like Flash.
Because it’s a very low-latency format, it’s the one most likely to need a streaming server or CDN infrastructure to scale up, and a few CDNs are focusing on delivering global WebRTC at half-second latencies end to end. This makes it ideal for wagering on sports but also makes it a potential replacement for remote broadcast links.
WebRTC does not use a container format at all. Instead, it streams the encoded data directly from one peer to another using the connection between browsers.
Newer HTTP-based video streaming protocols, such as Apple HTTP Live Streaming (HLS) or MPEG Dynamic Adaptive Streaming via HTTP (DASH), deliver small portions of the primary audio and video streams that are replayed on the end-user device in a pre-defined order that’s specified by a “traffic cop” text document known as a manifest file.
While there were a number of differences in HTTP-based delivery approaches a few years ago (including some that used MPEG-2 Transport Stream (MPEG-TS) format) today all HTTP approaches — whether for live or on-demand delivery — rely on fragmenting MP4 video files into segments. These MP4 video files often use AVC video codecs and AAC audio codecs and are also referred to as the ISO Base Media File Format (ISO BMFF) or MPEG-4 Part 14 (MP4).
Segmenting MP4 files into hundreds or thousands of small files, sometimes also referred to as packaging, can be done independently for all audio and video tracks (including segmenting alternate language tracks separately so as to lower the overall bandwidth required to deliver to a given end user). Even though every approach will be delivered via a standard HTTP server, each disparate flavor of segmentation requires additional time to encode a portion of the original content to an MP4 file format (ISO BMFF) and then package the MP4 fragment into a particular HTTP streaming approach. The end result is a scalable solution that suffers from end-viewer latencies ranging from 6-30 seconds behind the actual encoder.
Worse still, because these HTTP-based streaming formats read MP4 fragments (fMP4) but segment the fragments differently, each version of HTTP delivery requires its own storage solution. This may not seem to be a big consideration when it comes to live streaming, but delivery to multiple device types has the potential to add to the overall latency (see inflection point 2 at the outset of this article). In addition, most live streams are also saved for later on-demand viewing, so the storage considerations can be significant for long-format live streams such as sporting events and services at houses of worship.
Given the benefits of fMP4 and the downsides of disparate HTTP-based streaming approaches, the industry understood the need to work out a common packaging format so that files could be stored once but delivered to all devices. An effort undertaken by Apple, Microsoft, and the Motion Picture Experts Group (MPEG) resulted in a specification called the Common Media Application Format (CMAF).
Wowza’s Traci Ruether has written about the benefits of CMAF and how it addresses packaging differences by maintaining information about rationalizing the disparate packaging approaches all in a single format. A significant byproduct of the common format is that, when paired with low-latency behaviors across the ecosystem, CMAF lowers overall latencies from the 30-second range noted above, down to around 3 seconds — bringing it much closer to RTMP and WebRTC while maintaining HTTP delivery.
Akamai’s Will Law wrote extensively about this benefit, noting that an ultra-low latency streaming approach using encoding into CMAF-compliant segments means that the delivery to end devices would require no additional segmentation and repackaging time.
So is CMAF the universal format for both live and on-demand streaming? If you’d asked me in mid 2019, I’d have said it had strong potential. But by November 2019, at the Streaming Media West event, Apple’s Roger Pantos was demonstrating the step-by-step concept of a low-latency version of HLS that — while it’s based on fMP4 from a format standpoint, using a subsegment approach — is currently incompatible with low-latency CMAF. The benefit of Low-Latency HLS, according to Pantos, is latencies around 2 seconds, a fifty percent improvement on CMAF’s claims.
Industry feedback was pointed, given Apple’s initial advocacy for CMAF, and by early 2020, Apple had modified Low-Latency HLS to address one stumbling block — the requirement for HTTP/2 servers to deliver Low-Latency HLS — bringing it more in line with the CMAF approach.
We’ve briefly covered a number of areas where protocols and formats intersect, from device types to latencies. The common factor in all of these comes back to variations of the MP4 file format, more specifically fragmented MP4 and the ISO Base Media File Format. And, while the MP4 format will handle HEVC (also known as H.265), the current primary use for MP4 files is AVC video and AAC audio tracks.
Regardless of whether you choose to encode streams via RTMP or an HTTP encoder, the delivery format to almost every end user will be a version of MP4 file format segmented to meet the requirements of that end user’s device. For that approach, CMAF shows the most promise for a standards-based approach, but pay attention to Low-Latency HLS in the next few months, to see whether modifications will strengthen CMAF or set the industry on a course requiring support for two different formats.