Sound for UHD and 4K
Sound is a fundamental piece to improve the user experience of audiovisual content. We have interviewed Sergio Márquez, from Nómada Media, who has generously shared with us his knowledge and has helped us to capture in this section the keys to the recording, post-production and broadcasting of sound for cinema and ultra high definition video.
Let’s start by describing the parameters of digital sound quality:sample rate andbit depth. Next, we address a fundamental issue:loudness, which brings us closer to concepts such as the dynamic range of sound and dynamic compression.
Having set out these theoretical basics we will review the various surround sound codecs being used or developed for 4K cinema and ultra high definition video. We have organized it into five sections:
- The beginnings of digital audio for cinema
- Immersive sound, 3D sound or object-based sound
- Sound for movie theaters
- Codecs Sound for UHD and Blu-ray TV
- Audio codecs for internet broadcasting
Sample rate and bit depth (Bit Depth)
The two technical parameters that define the quality of a digital sound signal are sampling frequency and bit depth.
Thesample rate is the number of samples taken per second to generate the digital signal. It is expressed as a value of cycles per second in kilohertz (kHz).
Representation of two different sampling frequency values. The horizontal axis represents time and the vertical axis represents the value of the audio signal. Source: Progulator.
The sampling frequency used to record and process sound is determined by the sensory capacity of human beings: we are considered to be able to hear sounds between 20 and 20,000 Hz. According to Nyquist’s theorem, the sampling frequency must be greater than twice the maximum frequency to be sampled. Therefore, the sampling frequency must be at least 40 KHz.
The sampling frequency values commonly used for professional sound are as follows:
Sampling frequency | ||
Audio CD | 44.1 kHz | |
Broadcast / DVD | 48 kHz | |
DVD Audio / Blu-ray | 96 kHz | |
Blu-ray | 192 kHz |
In music production for CD, 44.1 kHz has traditionally been used because of the limited storage capacity of the media. But in audiovisual production this sampling frequency is considered insufficient.
The standard in audiovisual production is 48 kHz. Cameras and recorders, from the consumer range to the most professional level, use this sampling rate. It is also the standard forbroadcast television and DVD or Blu-ray discs. Similarly, in direct sound recording, 48 kHz is the standard.
Sample rate selection menu of a portable audio recorder: Source: Zoom
In recording and post-production studios for high quality audiovisual and film, the tendency is to work at 96 kHz. Although the direct sound material is recorded at 48 kHz, the technician does the postproduction and mixing(reverbs, delays…) at 96 kHz. This oversampling brings a number of advantages (greater ability to alter an original sound, etc.) and allows to leave the highest possible quality until the last moment.
Also in recording, recording at 96 KHz is becoming more and more frequent, since harmonics (integer multiples that accompany any frequency) are very important to perceive a sound more naturally. Above all, the first harmonic (x2) is fundamental. For example, when a violin emits a 15 kHz base note, its first harmonic is 30 kHz, and therefore we need 60 KHz sampling frequency to be able to capture and reproduce it. So it makes sense to work in a full 96 KHz system.
Higher sampling frequencies, such as 192 kHz, are used at ‘audiophile’ levels, in elite collections or labels, such as some classical music recordings by Deutsche Grammophon or in Jazz, the Chesky Records productions. These productions try to achieve the highest possible sound quality. But on a perceptual level, most people do not hear this difference. In theory at least, a sampling frequency of 48 kHz is sufficient to effectively register human hearing.
Bit depth (Bit Depth or quantization) is the number of bits used to record each sample. The more bits used, the more information is recorded and the more space the resulting file will occupy on the recording medium. Therefore, it has to do with the precision of the audio sample being quantized and how much information is defined.
The values commonly used in professional audio are as follows:
Bit depth (quantization) | ||
Audio CD | 16 bits | |
Broadcast / DVD / Blu-ray | 16-bit | |
High quality sound | 24 bits | |
Highest quality | 32 bits | |
Highest quality postproduction | 32 bit float |
In direct sound recording almost everything is done in 24-bit, although there is still a strong legacy of 16-bit. Equipment has been ready for 24 bits for years and can record at 32 bits without any problems.
Bit depth is closely related to the dynamic range of audio. That is, with the ability to record with quality the slightest sounds and those of higher level. Human perception of sound is estimated at 120 decibels of dynamic range and 24 bits is more than enough to record these values.
In postproduction everything can be done at 32 bits, at 32 bits floating point(float) and even some graphic cards at higher values. But the most common is to work at 24 bits.
The floating point representation indicates that as many decimal places as necessary will be used for the calculations performed by the equipment to ensure the preservation of the original quality. This option can be selected on high-quality post-production equipment.
These two variables, sampling rate and bit depth, result in a transfer rate(BitRate) measured in bits per second (b/s, Kb/s or Mb/s).
The transfer flow values with which we usually work in the different phases of sound production and diffusion are the following:
Transfer flow | ||
MP3 compressed mono sound (poor quality) | 64 Kb/s | |
Stereo audio with AAC compression (sufficient quality) | 256 Kb/s | |
Dolby Digital 5.1 on DVD | 448 Kb/s | |
Uncompressed PCM stereo audio (16bits; 48khz) | 1.5 Mb/s | |
5.1 PCM uncompressed audio (24bits; 48khz) | 7 Mb/s | |
DTS HD Master Audio 7.1 (24bit; 96 khz) | 24 Mb/s |
Dynamic compression / data compression
In digital sound the word “compression” is used in two different contexts with very different meanings:
- Dynamic compression. In sound post-production, compression is a manipulation of the dynamic levels of the signal that aims to reinforce weaker sounds and attenuate louder ones. Dynamic compressors are typically used in broadcast and popular music to increase loudness. High quality sound productions eschew the use of compressors in favor of dynamic range.
Dynamic compression of an audio signal. source:Articulate
- Data compression. When sound is digitally encoded, compression is the reduction of file size using algorithms called ‘audio codecs’. The best known example of data compression for sound is the ‘.mp3’ format, which drastically reduces the space occupied by an audio file on a hard disk, at the cost of a loss of information and therefore of quality. In cinema sound broadcasting the most widespread compression standard is ‘Dolby Digital’.
MP3 data compression. source: Sarte Audio Elite
Data compression in the highest quality formats. Source: The Media Server
Dynamic range and loudness
The dynamic range of sound is the difference between the lowest and highest level values. It is measured in decibels (db).
There is a direct relationship between bit depth and dynamic range of the sound: each bit used to generate the digital signal increases the sound pressure recording capacity by 6 db. With 16 bits, 96 db of dynamic range is achieved, and this is increased to 144 db for 24 bits. This dynamic range value is, in theory, sufficient, since the human capacity to perceive sound is around 120 db. Above 120 db is the threshold of pain.
Dynamic range recording capability as a function of bit depth. Source: Libremusicproduction.
In classical music recordings it makes a lot of sense to use 24 bits, since the difference in dynamics between a Piano-Pianissimo and the whole orchestra playing at the same time is very large.
In film, dynamic range is also very important. Studios recording orchestras for soundtracks often prefer to use 32 bits to capture more dynamics. Excellent sound reproduction conditions in an exhibition hall allow both the faintest sounds and the roar of action scenes to be clearly appreciated.
However, in radio or television, compressors are used to reduce the dynamic range and the result is that there is not as much difference between the lowest and highest sound levels. This dynamic compression is done to increaseloudness and aid sound intelligibility in critical listening conditions, for example, when listening to the radio in a car or the television in a noisy environment. The result is that both whispering and shouting remain at a high, intelligible level. The casualty in this process is the dynamic range and the result is a departure from the natural perception of sound.
Waveform of the uncompressed signal (bottom) and with dynamic compression (top). Source: Wachusett.
This issue is apparent in the loudness sensation of television advertising. Broadcasters have found that peak sound measurements do not reflect the loudness feel of a program. Producers of commercials make sound mixes with a lot of dynamic compression, sacrificing dynamic range in favor of louder sound. This generates user complaints and lowers sound quality standards:
“Everything sounds super-flat, and there is a decibel war to see who sounds louder”(Sergio Márquez).
The European standard EBU R 128 (2011) “Standardization of loudness and maximum permissible level of audio signals” attempts to bring order to this issue:
“The switch from audio peak-normalization to loudness normalization is probably the biggest revolution in professional audio of the last decades. It is important for broadcasters to be aware of the loudness paradigm and how to adopt their systems and working practices accordingly. .”(Roger Miles, EBU)
WLM loudness meter. source: Waves
“The important thing is not who sounds louder, but that the movie they are playing on TV sounds similar to how it sounded in the movie theater, that is, with its high peaks and whispering passages. We have the technology to deal with 120 db of dynamic range, so let’s forget about sounding all the time at 100 db on TV! Let’s try to respect an appropriate dynamic for each environment.”(Sergio Márquez)
“If a composer wants to transmit noise, aggressiveness, discomfort, anger, etc…. it seems a good idea to make wild things with the sound, but if the reason is ‘I don’t want to sound lower than the others’, I think it’s better to get used to the idea that the game rules have changed.”(Ibon Larruzea)
Codecs for multichannel sound
Fantasia (1940, Walt Disney) is the first film to experiment withsurround sound for cinema. Since then, many sound systems have been developed to enhance the user experience both in theaters and at home(home cinema).
Speaker placement in a multichannel surround sound system for a cinema exhibition hall. Source: wikipedia
Nowadays, for cinematography we can find different configurations for surround sound:
- L.C.R.(left, center, right): adds a center channel to the classic stereo format of the music industry. A variant, now in disuse, is LCRS which uses a single surround channel.
- Surround 5.1: adds two side channels (surround left, surround right) and a subwoofer channel.
- Surround 7.1: adds two rear channels(rear left, rear rght) to the 5.1.
- Immersive sound, 3D sound or object sound: this is a new concept of surround sound that is not based on the number of channels but on the spatial positioning of each sound source.
The 7.1 system dedicates two separate channels for the rear left and right sound differentiated from the side channels. Source: Tech Review
The beginnings of digital audio for cinema
In the years before digital cinema projection (DCI) there were three digital audio systems for theaters: Dolby, DTS and SDDS. These three formats solved the technical limitations of recording, storing and processing digital sound; as well as the synchronization of digital sound with 35mm film projection.
The first to arrive in this race was ‘Dolby Digital (AC-3)’ which was released in 1992 with Batman Returns (Tim Burton). It was the first 5.1 digital multichannel audio system for cinema. It was a codec with a lot of information loss, 10 to 1 (10:1), but it discriminated very well the most relevant sounds for the human being based on its excellent psychoacoustic coding. Technically it was very efficient as it managed to fit six channels in 448 kb/s. Without compression, the equivalent would be almost 7 Mb/s.
Synchronization with the projection was done by printing the sound tracks on celluloid. This technique had been in use since the early days of sound film. In the following image you can see the analog sound print of ‘Dolby Stereo’, which was first used with Star War (1977, George Lucas), and the Dolby Digital solution of the 1990s that used the film space between the perforations to print the digital audio signal.
Impression of soundtracks on celluloid: Dolby Stereo and Dolby Digital. source: Brian Florian
Just after the release of ‘Dolby Digital’, with Jurassic Park (1993, Steven Spilberg), ‘DTS Digital Theater System‘ appeared, which gave higher quality and used much less compression (3:1). This increase in quality was especially appreciated by the directors and the technical-artistic team responsible for the soundtrack, but it was more complex and costly. DTS used a proprietary system to synchronize the digital audio with the projection of the 35 mm film by means of a time code printed on the celluloid and CD discs.
The third system, ‘SDDS Sony Dynamic Digital Sound‘, was launched with Last Action Hero (1993, John McTiernan) and was installed mainly in theaters that had distribution agreements with Sony. This patent defined a first 7.1 system, which is different from the one currently in use: it was a 7.1 channel configuration, placing five of them(Left, Left-Center, Center, Right-Center and Right) behind the screen. Only the big Hollywood productions mixed in this format (mainly Columbia Pictures, a subsidiary of Sony) which, in addition, because it required more speakers and amplifiers, was much more expensive for the exhibitor to install.
Of these three solutions for digital audio in movie theaters, Dolby, which arrived first, managed to become almost the de facto standard.
Years later, when 35 mm film projectors were replaced by DCI digital systems, sound synchronization was no longer an issue. The film carrier became a DCP hard disk with sufficient capacity to store and playback uncompressed sound, and PCM (.wav) files could be used directly for stereo, 5.1 or 7.1 surround sound.
From then on, for digital cinema projection, multichannel sound can be made without paying any license fees or patent royalties.
The following screenshot shows the ‘OpenDCP’ software menu for the creation of a 7.1 audio file for movie theaters: 8 mono channels in uncompressed PCM (.wav) format are used to generate an MXF file.
Configuration of the sound channels for 7.1 for DCP with OpenDCP. The system creates an MXF file from the eight mono channels in PCM (.wav) format.
Report prepared by Luis Ochoa, Sergio Márquez and Francisco Utray.