Media compression aims to find a representation that can be represented with the fewest possible information (bitrate) and from which the original data can be reconstructed. A distinction is made between lossless and lossy compression, whereby in lossy compression the reconstruction in not necessarily perfect, thereby even lower bitrates can be achieved. This leads to a trade-off between low bitrate and high reconstruction quality, which is called rate-distortion trade-off. The processing steps ofa compression system include a lossless or lossy data transformation followed by a quantization step and potentially an entropy coding step, all of which are implemented in classical systems with hand-engineered functions. The resulting bitstream is decompressed at the receiver side by the respective inverse operations. Using deep neural networks, some or all of the above process steps can be learned, minimizing the reconstruction distortion and bitrate together as a cost function, and thereby outperforming classical systems. As a basic structure, an encoder and a decoder (autoencoder) are often learned inversely to each other as elementary blocks.
In the field of speech compression, e.g., a WaveNet-based decoder has been proposed, which is able to synthetize wideband speech samples based on a bitstream, that has been generated from a conventional codec. The approach is especially well-suited for lower bitrates. Furthermore, it can be shown that a trainable quantization scheme, used to learn a discrete latent representation, is able to produce results comparable to current coding standards, again particularly effective for low bitrates. Using an end-to-end-learned autoencoder trained on raw speech samples, the performance of learned speech compression can even compete with AMR-WB at various bitrates, while in this case particularly higher bitrates can take profit, such that significant improvements can be witnessed. Last but not least, deep neural networks can serve as speech enhancers for decoded speech after lossy compression.
While a fixed bitrate is often required for speech coding, an entropy model is usually used in image compression, so that the exact bitrate depends on the respective input (variable bitrate). Therefore, bitrate specifications usually refer to average bitrates. The rate-distortion trade-off can then be weighed by the ratio between distortion and bitrate, both of which are included in the loss function. The quality of the lossy transformation and the quantization determine the distortion and the entropy model determines the bitrate. Many improvements are achieved in the architectures used for image compression. For example, skip connections are adopted from normal autoencoders to increase the reconstruction quality at the expense of a second bottleneck. If this additional side information is included in the entropy modeling, it is called a hyperprior architecture. For evaluation, various metrics are used that model human perception to varying degrees. So-called generative adversarial networks (GANs) have been used to create images that look particularly real. However, compression is not always intended to produce reconstructions that look good to the human eye. If the images are input data for a subsequent processing step, the performance of this stage of the process can also be an important metric. In this case, learned image compression has been shown to offer an advantage over classical codecs.
Samples from (a) JPEG, (b) JPEG2000, (c) WebP, and (d) GAN compression including quantization in the training process. The effects can be viewed best in color and on a computer screen.
The difference between image and video compression is the inclusion of temporal context, which can significantly reduce the bitrate of video compression compared to single-frame image compression. To model the temporal context in an image, optical flow is often used, which assigns a motion vector to pixels in single frames. If the optical flow is available, a prediction for the next frame can be generated by motion estimation before the actual compression. Only the deviation between this prediction and the actual next frame is then compressed. On the receiver side, the inverse process is performed in the form of motion compensation. The actual architectures used to compress the deviation are often very similar to the structures used in image compression. It is further possible to use recurrent networks such as convolutional LSTMs instead of normal autoencoders, so that the networks learn the temporal context implicitly and store it as an internal state across multiple frames.
[1] Z. Zhao, H. Liu, and T. Fingscheidt, “Convolutional Neural Networks to Enhance Coded Speech," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 4, pp. 663 –78, Apr. 2019.
[2] J. Löhdefink, A. Bär, N. M. Schmidt, F. Hüger, P. Schlicht, and T. Fingscheidt, “On Low-Bitrate Image Compression for Distributed Automotive Perception: Higher Peak SNR Does Not Mean Better Semantic Segmentation," in Proc. of IV, Paris, France, Jun. 2019, pp. 424 – 431.
[3] J. Löhdefink, F. Hüger, A. Bär, P. Schlicht, and T. Fingscheidt, “Scalar and Vector Quantization for Learned Image Compression: A Study on the Effects of MSE and GAN Loss in Various Spaces," in Proc. of ITSC, Rhodes, Greece, Sep. 2020, pp. 1 – 8.
[4] J. Löhdefink, A. Bär, N. M. Schmidt, F. Hüger, P. Schlicht, and T. Fingscheidt, „Focussing Learned Image Compression to Semantic Classes for V2X Applications," in Proc. of IV, Las Vegas, NV, USA, Oct. 2020, pp. 1 – 8.