MSVP: Meta’s first ASIC for video transcoding
Mai 18, às 18:39
7 min de leitura
Here are the primary stages of the transcoding process:
- Support for a variety of elementary input video stream formats, including H.264, HEVC, VP9, and AV1
- All profiles, 8/10-bit pixel sample depths, and YUV420 color format
- Format conversion, including video overlays
- Frame resampling operation from arbitrary input resolutions to multiple resolutions (up to 4x) with high precision and wide filters
- Shot detection
- Support for H.264 (AVC) and VP9 coding standards
- 8-bit pixel sample depth, YUV420 color format
- Full reference: SSIM, MS-SSIM, VIF, PSNR (Luma and Chroma)
- No-reference blurriness
- QM at multiple viewport resolutions
These stages are implemented as memory-to-memory operations, meaning intermediate buffers are stored back to DRAM and refetched as needed by the downstream operation.
Power and performance
Each MSVP ASIC can offer a peak transcoding performance of 4K at 15fps at the highest quality configuration with 1-in, 5-out streams and can scale up to 4K at 60fps at the standard quality configuration. Performance scales linearly with resolution. This performance is achieved at ~10W of PCIe module power. We achieved a throughput gain of ~9x for H.264 when compared against libx264 SW encoding. For VP9, we achieved a throughput gain of ~50x when compared with libVPX speed 2 preset.
faster throughput for H.264 compared to libx264 SW
faster throughput for VP9 compared to libVPX speed 2
In video coding, we deploy a method to assess and compare compression efficiency, called the Bjontegaard delta rate (BD-Rate), which estimates the number of bits saved (if the BD-Rate is a negative figure) in order to deliver the same objective quality for a given video over a baseline configuration.
MSVP’s video encoding algorithms
The MSVP encoder has two main goals: to be highly power efficient and to deliver the same or better video quality as software encoders. There are existing video encoder IPs, but most of them are targeted at mobile devices with tight area/power constraints and cannot meet the quality bar set by current software encoders.
Because software encoders offer very flexible control and fast evolution over time, it is quite challenging for ASIC video encoders to meet the same performance bar as software encoders.
Here’s a simplified version of the data flow of modern hybrid (hardware and software) video encoders:
Simplified video encoder modules.
These encoders use intra-coding to reduce spatial redundancy and inter-coding to remove temporal redundancy. Different stages of motion estimation are applied in inter-coding to find out the best prediction among all possible block positions in available reference frames. Entropy coding is the lossless compression part that squeezes the statistical redundancy of all syntax elements, including encoding modes, motion vectors and quantized residual coefficients.
For MSVP’s algorithms to perform the way we wanted, we had to find hardware-friendly alternatives for each of the above key modules. We mainly focused on three levels: block level, frame level, and group of picture (GOP) level.
At the block level, we looked for coding tools with the highest return on investment, that were easy/economical (in terms of silicon area and power requirements) to implement in hardware, and that met our performance targets while maximizing compression efficiency. At frame level, we studied the best algorithms to make intelligent frame type decisions among I/P/B frames, and the best rate-control algorithms based on statistics collected from hardware. And at the GOP level, we had to figure out whether to use multiple-pass encoding with look-ahead, or to insert intra (key) frames at a given shot boundary.
Motion estimation is one of the most computationally intensive algorithms in video encoding. To find accurate motion vectors that closely match the block currently being encoded, a full motion estimation pipeline often includes a multistage search to balance among large search range, computing complexity, and accuracy.
MSVP’s motion search algorithm needs to be one that identifies which potential neighboring blocks can contribute more to quality and only searches around highly correlated neighbors with a limited cycle budget. Although we lack the flexibility of iterative software motion search algorithms, such as diamond or hexagon shapes, hardware motion estimation can search multiple blocks in parallel. Thus, it allows us to search more candidates, cover a larger search range and more reference frames in both single direction and bidirectional mode, and search all supported block partition shapes in parallel.
Rate distortion optimization (RDO)
Achieving high video encoding quality also requires RDO support. Since there are so many decisions to make in video encoding (intra/inter modes, partition block size, transform block types/sizes, etc.), RDO is one of the best practices in video compression to determine which mode is optimal given the current rate or quality target.
MSVP supports exhaustive RDO at almost all mode decision stages. Distortion calculation is intensive but both straightforward and easily parallelizable. But the unique challenge is the bit rate estimation. Entropy coding for the final bitstream is sequential in nature, and each context model is dependent on the previously encoded ones. In a hardware encoder implementation, rate distortion (RD) cost for different blocks/partitions might be evaluated in parallel; thus, it is impossible to have very accurate bit rate estimation. We implemented a pretty accurate bit rate estimation model in MSVP. The model is hardware friendly, in that it is easy to evaluate multiple coding modes in parallel.
Quantization is the only lossy part of video compression, and it is also the dominant bit rate control knob in any video coding standard. The corresponding parameter is called the quantization parameter (QP), and it is inversely related to quality: Low QP values result in small quantization errors, creating low distortion levels and, subsequently, high quality at the expense of higher bit rates. By making smart quantization choices, encoding bits can be allocated to areas that impact visual quality the most. We perform smart quantization using optimal QP selection and rounding decisions.
Modern video coding standards allow different QP values to be applied to different coding units. In MSVP’s hardware encoder, block-level QP values are determined adaptively based on both spatial and temporal characteristics.
In spatial adaptive QP (AQP) selection, since the human visual system is less sensitive to quality loss at high texture or high motion areas, a larger QP value can be applied to these coding blocks. In temporal AQP, coding blocks that are referenced more in the future can be quantized with a lower QP to get higher quality, such that future coding blocks that reference these blocks will benefit from it.
Smart rounding tries to make a joint optimization on rounding decisions for all coefficients in each coding block. Since the choices of rounding at different coefficient positions are dependent on one another, we need better algorithms that remove the dependency while maintaining the rounding decision accuracy. To reduce compute cost, we’ve applied smart rounding to the final stage after the coding mode for each block is determined. This feature alone can achieve a ~1 percent to 2 percent BD-Rate improvement.
The frame-level algorithm for the MSVP H.264 encoder can be configured to be either two-pass or one-pass, depending on whether it is a VOD or live streaming use case. In the high quality (longer latency) VOD two-pass mode, MSVP looks ahead N frames and collects statistics, such as intra/inter cost and motion vectors, from these frames. Then, based on the statistics collected in the look-ahead, frame level control applies back-propagation on the reference tree in the look-ahead buffers for each reference frame to assign an importance to frames. Then, finally, the accumulated reference importance of the frame to be coded is modulated using temporal AQP of each block. Finally, the delta QP map is passed to the final encoding pass to be used as the encoding QP, also captured in the output bitstream.
MSVP H.264 encoder frame level control flow.
In MSVP’s VP9 encoder, multiple-pass encoding is also enabled for high-quality VOD use cases. An analysis pass (the first pass) is performed up front to capture the video characteristics into a set of statistics, and the statistics are used to determine the frame level parameters for filtering and encoding. Since VP9’s frame type is different from H.264’s, the strategy for making frame level decisions is also different, as shown in the following figure:
VP9 encoder frame level algorithm flow.
Jun 3, às 23:41
Jun 3, às 23:04
Jun 3, às 23:00
AI | Techcrunch
Jun 3, às 22:57
Jun 3, às 22:45
Marktechpost AI Research News
Jun 3, às 22:40
Jun 3, às 20:55
Jun 3, às 20:51
Jun 3, às 20:18
AI | Techcrunch
Jun 3, às 20:15