AI in Video Production and Delivery
Technologies driven by artificial intelligence (AI) are appearing in many markets, including video streaming. AI is a general description of a program that can sense, reason, act or adapt, as shown in the figure below.
Though the lines are fuzzy between classifications, most AI advancements and applications are more accurately called “machine learning.” Machine learning is a more specific subset of AI, and describes any algorithm that improves over time by incorporating more data.
In turn, “deep learning” is a more sophisticated approach to machine learning. It incorporates the use of a multi-layered neural network: a digital network of artificial neurons, modeled after the human brain, that takes in large amounts of labeled data and uses it to categorize information, identify patterns, learn and adapt.
How AI, machine learning and deep learning fit together (Source: Prowess)
This article will explore AI, machine learning and deep learning. We’ll discuss their primary applications to video encoding and streaming delivery, as well as their other applications in video production.
Machine Learning in Encoding
Different types of streaming content need to be processed differently. But prior to 2015, most encoding shops used a fixed bitrate ladder for every video, irrespective of the content type or the nuances of individual pieces. For example, talking heads can look great at far lower data rates than a soccer match or ballet. Treating this content the same tends to result in one appearing poor-quality, while another unnecessarily consumes large amounts of data and bandwidth.
This all changed when Netflix released their per-title encoding schema in December 2015. At a high level, the schema involved encoding each source file to multiple resolutions and data rates, then using a video metric called PSNR (Peak Signal to Noise Ratio) to pick the best iteration for each point in the encoding ladder. But PSNR is an older, static algorithm, and is often criticized for being a poor predictor of how human observers would actually rate the videos—making it less than ideal for a per-title encoding schema for premium Netflix viewers.
Then, in June 2016, machine learning entered the picture, in the form of a new metric and method called Video Multimethod Assessment Fusion (VMAF). In VMAF, the results of subjective tests are fused with analytical metrics via a machine-learning algorithm: Put another way, the algorithm “learns” data values and patterns by analyzing “good quality” samples, identified and fed to it by human observers.
This allows VMAF to not only incorporate subjective evaluations into its scoring system, but also to improve over time. It also allows Netflix to tune the metric for different consumption methods (e.g., mobile versus big-screen viewing), by feeding different subjective results into the machine-learning algorithm.
Netflix has open-sourced VMAF, so it can be tuned for different content types and use cases. As an example, a security firm could input the results of subjective evaluations of security footage to create a VMAF tuned for that type of video. The Cartoon Network or a sports network could do the same for their own footage.
Machine Learning Helps Netflix, YouTube Encode Smarter
As deployed by Netflix, VMAF serves as a vital component of a scoring system—but it doesn’t set encoding parameters. Rather, it replaced PSNR as the mechanism for identifying the highest-quality streams from the dozens of test encodes produced for each source file. When you’re encoding a relatively limited group of files for ultra-high-volume viewing, this brute-force approach makes sense.
With 300 hours of video uploaded each minute, YouTube has a very different problem: how to get a reasonable-quality encode in a single try. To accomplish this, YouTube created a neural network that incorporated data from over 137,000 test encodes on 14,000 clips.
When encoding an uploaded clip, YouTube inputs data learned from a mezzanine transcode and low-resolution test encode of the clip into the neural network. The output is a single encoding parameter used to encode each clip to meet YouTube’s quality objectives. Netflix reportedly is developing a similar approach, called the Dynamic Optimizer, to deploy scene-based encoding for its videos.
Commercially Available Per-Title Technologies
What’s all this mean to you? You probably don’t have the problems faced by Netflix and YouTube. But you do have your choice of desktop or cloud encoder, and can look for a solution with some form of per-title encoding.
To date, most commercially available per-title features have used static algorithms. For example, Capella Systems Source Adaptive Bitrate Ladder (SABL) is based on Constant Rate Factor encoding, or CRF. Available in the x.264, x.265, VP9 and other codecs, CRF is a rate-control mechanism that adjusts the data rate of the encoded video up and down to maintain constant quality.
Though Capella’s SABL works very well, you can’t train it with external, subjective data. Other encoding shops have introduced per-title features, but there’s no evidence they incorporate formal machine learning.
But that may soon change. In an October 2017 blog post, compression guru Fabio Sonnati detailed a project he was working on with NTT Data, which develops and sells streaming encoders. According to Sonnati, more than 14,000 quality ratings were used to train the machine learning algorithm, which allows the user to select a target mean opinion score (MOS) for each encode. (MOS is the ranking used for subjective tests, usually on a scale from 1-5.)
Presumably, it won’t take long for features like this to appear in multiple cloud or even desktop encoders. In fact, in early 2018, EuclidIQ released a cloud encoding platform called Rithm.
According to the company’s website, “Rithm’s content adaptation models video quality on scores recorded by human subjects, so Rithm’s AI is based on human perception, not some engineer’s equations.” In other words, while not yet generally available, machine-learning-based encoding systems will soon be the norm.
Machine-Learning Applications in Streaming Technology
Streaming producers deploy adaptive bitrate video to deliver high-quality experiences across a variety of networks and devices; for example, it’s one of the key features deployed by users of the Wowza Streaming Engine™ software. Still, buffering continues to be an unfortunate fact of life for many viewers.
MIT’s Pensieve uses a neural network to decide which streams to retrieve to improve QoE
One machine-learning-based approach to this issue is from MIT’s Computer Science and Artificial Intelligence Laboratory. Called Pensieve, it uses a neural network to make data-driven decisions about which streams to retrieve to avoid buffering and other playback issues (as shown in the image above). Pensieve will replace the simple bitrate switching algorithms used by most players today, which should reduce buffering for all viewers.
As with encoding, this isn’t a system that the local hardware store will develop to improve playback of their how-to videos. But there’s little doubt that features like this will become standard for online video platforms and content delivery networks at some point in the future.
Artificial Intelligence in Video Production
Outside of streaming, AI is poised to play an increasingly large role in all aspects of video production. An article entitled AI and the Next Decade of Video Production on Chicago video producer Richer Studios website provides an excellent overview.
One of the first roles identified in the article is movie-trailer creation, where the author describes IBM creating a trailer for the horror movie “Morgan” using machine learning. First, IBM trained the system by feeding it data from over 100 horror movies and their trailers. Then IBM fed “Morgan” into the system, which identified 10 “moments” totaling six minutes that were the best candidates for the trailer. Then an IBM filmmaker crafted these moments into the trailer shown below:
The Richter article goes on to identify other ways AI will impact video production, including editing, scoring, scriptwriting, cinematography and even voice-over talent. While not are all ready for prime time, as with objects in the mirror, most are closer than they may appear.
If you substitute the term “machine learning” or “deep learning” for “artificial intelligence,” you get a more accurate description of most the applications discussed above. Whatever the proper title, this is big-budget, need-a-roomful-of-PhDs kind of stuff.
Accordingly, for most streaming producers, these technologies will be delivered via third party products and services and not through their own development. However delivered, they will help us do our jobs faster and better—and in some cases, will replace us. So ignore these technologies at your own peril.
Want to learn more about streaming? Subscribe to our blog so you don’t miss an update.
Search Wowza Resources
About Jan Ozer
Jan Ozer is a leading expert on H.264, H.265, VP9, and AV1 encoding for live and on-demand production. Jan develops training courses for streaming media professionals, provides testing services to encoder developers, and helps video producers perfect their encoding ladders and deploy new codecs. He’s a contributing editor to Streaming Media Magazine and blogs at www.streaminglearningcenter.com.