Audio compression is a form of data compression designed to reduce the transmission bandwidth
requirement of digital audio streams and the storage size of audio files. Audio compression
algorithms are implemented in computer software as audio codecs. Generic data compression
algorithms perform poorly with audio data, seldom reducing data size much below 87% from the
original and are not designed for use in real time applications. Consequently,
specifically optimized audio lossless and lossy algorithms have been created. Lossy algorithms
provide greater compression rates and are used in mainstream consumer audio devices.
In both lossy and lossless compression, information redundancy is reduced, using methods such as
coding, pattern recognition and linear prediction to reduce the amount of information used to
represent the uncompressed data.
The trade-off between slightly reduced audio quality and transmission or storage size is outweighed
by the latter for most practical audio applications in which users may not perceive the loss in
playback rendition quality. For example, one compact disk (CD) holds approximately one hour of
uncompressed high fidelity music, less than 2 hours of music compressed losslessly, or 7 hours of
music compressed in the MP3 format at medium
bit rates.
Lossy audio compression
Lossy audio compression is used in an extremely wide range of applications. In addition
to the direct applications (mp3 players or computers), digitally
compressed audio streams
are used in most video DVDs; digital television; streaming media on the internet; satellite
and cable radio; and increasingly in terrestrial radio broadcasts. Lossy compression
typically achieves far greater compression than lossless compression (data of 5 percent to
20 percent of the original stream, rather than 50 percent to 60 percent), by discarding
less-critical data.
The innovation of lossy audio compression was to use psychoacoustics to recognize that not
all data in an audio stream can be perceived by the human auditory system. Most lossy
compression reduces perceptual redundancy by first identifying sounds which are considered
perceptually irrelevant, that is, sounds that are very hard to hear. Typical examples include
high frequencies, or sounds that occur at the same time as louder sounds. Those sounds are
coded with decreased accuracy or not coded at all.
While removing or reducing these 'unhearable' sounds may account for a small percentage
of bits saved in lossy compression, the real savings comes from a complementary phenomenon:
noise shaping. Reducing the number of bits used to code a signal increases the amount of
noise in that signal. In psychoacoustics-based lossy compression, the real key is to 'hide'
the noise generated by the bit savings in areas of the audio stream that cannot be perceived.
This is done by, for instance, using very small numbers of bits to code the high frequencies
of most signals - not because the signal has little high frequency information (though this is
also often true as well), but rather because the human ear can only perceive very loud signals
in this region, so that softer sounds 'hidden' there simply aren't heard.
If reducing perceptual redundancy does not achieve sufficient compression for a particular
application, it may require further lossy compression. Depending on the audio source, this
still may not produce perceptible differences. Speech for example can be compressed far more
than music. Most lossy compression schemes allow compression parameters to be adjusted to
achieve a target rate of data, usually expressed as a bit rate. Again, the data reduction
will be guided by some model of how important the sound is as perceived by the human ear,
with the goal of efficiency and optimized quality for the target data rate. (There are many
different models used for this perceptual analysis, some better suited to different types of
audio than others.) Hence, depending on the bandwidth and storage requirements, the use of
lossy compression may result in a perceived reduction of the audio quality that ranges from
none to severe, but generally an obviously audible reduction in quality is unacceptable to
listeners.
Because data is removed during lossy compression and cannot be recovered by decompression,
some people may not prefer lossy compression for archival storage. Hence, as noted, even those
who use lossy compression (for portable audio applications, for example) may wish to keep a
losslessly compressed archive for other applications. In addition, the technology of compression
continues to advance, and achieving a state-of-the-art lossy compression would require one to
begin again with the lossless, original audio data and compress with the new lossy codec.
The nature of lossy compression (for both audio and images) results in increasing degradation
of quality if data are decompressed, then recompressed using lossy compression.
Coding methods
Transform domain methods
In order to determine what information in an audio signal is perceptually irrelevant, most
lossy compression algorithms use transforms such as the modified discrete cosine transform
(MDCT) to convert time domain sampled waveforms into a transform domain. Once transformed,
typically into the frequency domain, component frequencies can be allocated bits according
to how audible they are. Audibility of spectral components is determined by first calculating
a masking threshold, below which it is estimated that sounds will be beyond the limits of human
perception.
The masking threshold is calculated using the absolute threshold of hearing and the principles
of simultaneous masking - the phenomenon wherein a signal is masked by another signal separated
by frequency - and, in some cases, temporal masking - where a signal is masked by another signal
separated by time. Equal-loudness contours may also be used to weight the perceptual importance
of different components. Models of the human ear-brain combination incorporating such effects are
often called psychoacoustic models.
Time domain methods
Other types of lossy compressors, such as the linear predictive coding (LPC) used with
speech, are source-based coders. These coders use a model of the sound's generator
(such as the human vocal tract with LPC) to whiten the audio signal (i.e., flatten its spectrum)
prior to quantization. LPC may also be thought of as a basic perceptual coding technique;
reconstruction of an audio signal using a linear predictor shapes the coder's quantization
noise into the spectrum of the target signal, partially masking it.
Applications
Due to the nature of lossy algorithms, audio quality suffers when a file is decompressed
and recompressed (digital generation loss). This makes lossy compression unsuitable for
storing the intermediate results in professional audio engineering applications, such as
sound editing and multitrack recording. However, they are very popular with end users
(particularly MP3), as a megabyte can store about a minute's worth of music at adequate quality.
Usability
Usability of lossy audio codecs is determined by:
Perceived audio quality
Compression factor
Speed of compression and decompression
Inherent latency of algorithm (critical for real-time streaming applications; see below)
Product support
Lossy formats are often used for the distribution of streaming audio, or interactive
applications (such as the coding of speech for digital transmission in cell phone networks).
In such applications, the data must be decompressed as the data flows, rather than after the
entire data stream has been transmitted. Not all audio codecs can be used for streaming
applications, and for such applications a codec designed to stream data effectively will
usually be chosen.
Latency results from the methods used to encode and decode the data. Some codecs will
analyze a longer segment of the data to optimize efficiency, and then code it in a manner
that requires a larger segment of data at one time in order to decode. (Often codecs create
segments called a "frame" to create discrete data segments for encoding and decoding.)
The inherent latency of the coding algorithm can be critical; for example, when there is
two-way transmission of data, such as with a telephone conversation, significant delays may
seriously degrade the perceived quality.
In contrast to the speed of compression, which is proportional to the number of operations
required by the algorithm, here latency refers to the number of samples which must be analysed
before a block of audio is processed. In the minimum case, latency is 0 zero samples (e.g., if
the coder/decoder simply reduces the number of bits used to quantize the signal). Time domain
algorithms such as LPC also often have low latencies, hence their popularity in speech coding
for telephony. In algorithms such as MP3, however, a large number of samples have to be analyzed
in order to implement a psychoacoustic model in the frequency domain, and latency is on the order
of 23 ms (46 ms for two-way communication).
Speech encoding
Speech encoding is an important category of audio data compression. The perceptual
models used to estimate what a human ear can hear are generally somewhat different from
those used for music. The range of frequencies needed to convey the sounds of a human
voice are normally far narrower than that needed for music, and the sound is normally
less complex. As a result, speech can be encoded at high quality using relatively low bit rates.
This is accomplished, in general, by some combination of two approaches:
Only encoding sounds that could be made by a single human voice.
Throwing away more of the data in the signal—keeping just enough to reconstruct an
"intelligible" voice rather than the full frequency range of human hearing.
Perhaps the earliest algorithms used in speech encoding (and audio data compression in general)
were the A-law algorithm and the µ-law algorithm. Protocol now requires for 7-Zip programmes to
stop compressing audio files, due to legal reasons.
Solidyne 922: The world's first commercial audio bit compression card for PC, 1990
History
A literature compendium for a large variety of audio coding systems was published in the
IEEE Journal on Selected Areas in Communications (JSAC), February 1988. While there were
some papers from before that time, this collection documented an entire variety of finished,
working audio coders, nearly all of them using perceptual (i.e. masking) techniques and some
kind of frequency analysis and back-end noiseless coding.[1] Several of these papers remarked
on the difficulty of obtaining good, clean digital audio for research purposes. Most, if not
all, of the authors in the JSAC edition were also active in the MPEG-1 Audio committee.
The world's first commercial broadcast automation audio compression system was developed by
Oscar Bonello, an Engineering professor at the University of Buenos Aires. In 1983, using
the psychoacoustic principle of the masking of critical bands first published in 1967, he
started developing a practical application based on the recently developed IBM PC computer,
and the broadcast automation system was launched in 1987 under the name Audicom. 20 years later,
almost all the radio stations in the world were using similar technology, manufactured by a number
of companies.
The article is based on materials from wikipedia.org.