Despite critics claims, acceptance of voice-over-IP (VoIP) technology continues to gain ground in the enterprise and core network. With standards like PacketCable and 3GPP emerging, VoIP is also set to gain ground in the access markets.
But to ultimately succeed, designers must ensure that VoIP systems can deliver the quality of speech that end users demand. And to reach that plateau, designers must grapple with a host of impairments including as packet loss, delay, jitter, echo, noise and more.
This two-part series lays out the challenges designers must tackle to improve voice quality in a VoIP system design. The first part of the paper lays out the key attributes required in speech coder technology as well as headaches that the packet network provides speech-coding architectures. The second part of this series, which will appear online tomorrow, will examine how designers can account for speech coding problems through echo cancellation and more.
Coder Attributes
When designing a VoIP system, the choice of a speech coder is function of a number of network factors such as the expected delay and the available processing power, as well as the user requirement of service quality and expectation of speech quality. The attributes of a speech coder include bit rate, complexity, delay, and quality4. Let's look at these four dimensions in more detail.
1. Bit rate and required bandwidth: The bit rates of the coders defined by the ITU range from the low 2.4 kbit/s coders used in secure telephony to 64 kbit/s wideband coders, such as the G.722 or the G.711 pulse-code modulated (PCM) coder. The rate of the coder determines the required channel bandwidth. In cellular telephony, for instance, preserving bandwidth is crucial. As such, variable bit rate coders, such the enhanced variable rate coder (EVRC) used in 2G CDMA systems were designed to drop the coding rate during speech inactivity.
2. Delay: The delay of the coder is relevant to the extent that it adds to the overall end-to-end delay in a VoIP call. The total delay of a coder includes the framing, as well as the algorithmic or look-ahead delay. In G.728 for instance, frames are five samples long, whereas in cellular-telephony coders, frame sizes of 160 samples (typically 20 ms) are more common. High rate coders, such as G.711 and G.726 have a very low delay.
3. Quality: The quality of speech is a subjective measure that reflects on the way the signal is perceived by listeners. It can be expressed in terms of how much effort is required to understand the message or how pleasant or comfortable speech sounds to the human ear. Intelligibility on the other hand is an objective measure of the amount of information that can be extracted by listeners from the given signal.21. In military contexts, intelligibility is of critical importance, whereas in consumer telephony, quality takes precedence. The quality of speech coders is often measured though a mean opinion score (MOS) experiment. Quality degradation is also tested under bit error rate, frame erasure and background noise that may cause the coder to generate various unpleasant artifacts.
4. Complexity: Speech coding algorithms are in general computation intensive. As a result, they are typically implemented on programmable digital signal processors (DSPs) that are optimized for signal processing operations, such as convolutions, Fast Fourier transforms (FFTs), and digital filtering. PC-based processors have in recent years evolved to provide enough processing power to make them appropriate candidates to run complex operations such as speech coding. As the VLSI technology enables more MIPS per silicon area, at a decreasing cost, the complexity aspect is less crucial than it used to be. However, it is always desirable to pack as much functionality in a processor, and have efficient algorithms that do not use up a large percentage of the available processing power.
The commonly used coders such as G.723, G.729, and G.728 were developed with specific requirements and priorities in mind; as such, they provide different levels of compromises along these four dimensions. A summary of the performance level provided by each of these coders is highlighted in References 4, 5, and 6 as well as in Table 1.
Table 1: Summary of Attributes for 3 Commonly Used VoIP Coders
| Attribute |
G.723.1 |
G.729 |
G.729a |
| Bit rate |
6.4kbit/s 5.33kbit/s |
8 kbit/s |
8 kbit/s |
| Frame size |
30 ms |
10 ms |
10 ms |
| Look ahead |
7.5 ms |
5 ms |
5 ms |
| Total delay |
67.5 ms |
25 ms |
25 ms |
| Complexity RAM |
16 MIPS 2.2 kwords |
20 MIPS 3 kwords |
10 MIPS 2 kwords |
Quality Issues in Packet Networks
To ensure high speech quality in an IP network, designers must realize that many of the challenges lie in the inherent nature of the network. There are really five big issues that designers will encounter in the network: packet loss/bit-error rates, delay, jitter, echo background noise, and tandeming effects. Let's look at each of these in more detail.
1. Packet Loss and Bit-Error Rates
In an end-to-end VoIP network, packets are lost due to either excessive bit errors, or congestion in the IP network, or simply excessive delay that cause the receiver to ignore the corresponding speech frames in the decoding operation. The first cause is the access network itself that includes a noisy channel, such as a wireless link or a cable or a DSL or a voice-band modem. In each channel, a certain amount of error detection and correction is designed in at the physical layer (PHY) to guarantee an upper limit on the bit error rate (BER). A packet is declared corrupted whenever it contains error bits that could not be corrected by the FEC mechanism.
The second cause of loss is due to the IP network itself, which is operated on a best effort basis. During peak traffic times, queues at intermediate routers may overflow and packets are simply dropped. Analyses of the loss statistics suggest that packet loss is highly bursty and the frequency distribution of the number of consecutive losses decreases geometrically.2, 18 For this reason, most recovery techniques are optimized for a maximum of 1- or 2-packet loss in a row.
Finally, packets are dropped (or ignored) at the receiver due to an excessive delay in arrival. In this case, it is better to ignore the packet and reconstruct the parameters than extend the delay in speech reconstruction. In general, voice traffic can tolerate some form of packet loss, depending on the coding algorithm, but a rate of greater than 5 percent is considered harmful to the voice quality and will result in a drop below toll quality for most coders.3
2. Delay
Long delays in speech communications cause echo and talker overlap problems. Echo is caused by the telephone hybrid circuit at the far end and causes the near-end talker to hear a reflected version of his voice. This reflection becomes annoying when the delay is greater than 50 ms. Talker overlap becomes significant if the one-way delay is greater than 250 ms, as the conversation becomes more of a push-to-talk rather than a normal conversation.
The source of delay in VoIP system is due to a number of factors:
- Framing delay, defined as the time to collect and frame the samples. The value is function of the coders used (e.g. 10 ms for G729a; 30 ms for G.723).
- Algorithmic delay, defined as the look-ahead delay required for some speech coders or some acoustic echo cancellers.
- Processing delay, which is function of the user equipment, such as the processor speed and the efficiency of the coder implementation. It also includes other higher-layer functions such as the concatenation of several speech frames into a single packet to reduce overhead.
- Network delay, which includes the various routing and buffering in the IP network, and scheduling and buffering at the receiver end to remove packet jitter.
To illustrate the impact of delay, consider the case of dial-up VoIP call originating from a user PC and utilizing a G.723 coder. The minimum values for the end-to-end delay components are given in Table 2 below:1
Table 2: Various Delay Components in a VoIP Call
| Component |
Theoretical Delay (ms) |
| PC client |
67.5 |
| Access |
44 |
| IP network |
40 |
| Gateway |
67.5 |
| PSTN/phone |
Negligible |
| Total |
159 |
3. Jitter
Jitter is the variance in the delay between consecutive packets. It is due to the delay difference on different routes throughout the IP network. Even if intermediate routing of traffic provides priority to voice traffic, there is no guarantee that consecutive packets arrive in order at the destination.
A typical remedy for jitter is to provide buffering at the destination to wait for late arriving packets and then re-sequence the speech frames for proper decoding. However, there is a limit on the amount of buffering that is practical.
A large jitter will result in more packets being dropped (i.e. lost) and this will impact quality. In some applications, the jitter buffer length is dynamically updated (Figure 1) to get an acceptable ratio of late arrivals over successfully processed frames.16 This however results in a changing average delay (due to buffering) and in turn requires that echo cancellation algorithms be capable of fast adaptation in their estimate of the round trip delay, as it changes dynamically during the course of a conversation.
Both line echo cancellers as well as acoustic echo cancellers are needed to eliminate the echo so that the perceived quality is not impaired. The ITU-T recommendations G.165 and G.168 specify the characteristics of echo cancellers, in terms of required length of the delay to cancel as well as the targeted echo attenuation.
In the context of mobile telephony and conference call setting, surrounding acoustic noise often corrupts speech signals. This in turn has an adverse effect on the perceived quality and intelligibility of speech as well as on the performance of speech coders. These coders rely on a model for the clean signal and cannot properly handle background noise signals such as engine, wind, traffic, music or the aggregate effect of many interfering speakers. As result of the coding process, the effect of background noise is often amplified and results in unnatural and annoying sounds to the far end user (Figure 2) . The case is more severe for low rate coders and more so for CELP-based coders than for waveform coders such as PCM or adaptive differential PCM (ADPCM).