Blog Post

Measuring VoIP Quality with SIP and RTP

With Catchpoint's new custom monitor, we can now capture packet loss, jitter, and RTT to measure the quality of an audio session over SIP.

The pandemic has changed the way teams collaborate within an organization and between companies. With work from home becoming the new normal, employees are turning to new options for collaboration, meeting, training and onboarding have moved online. The office is now a virtual space. With the increasing demand for online meetings, it is even more important to monitor the health and performance of such meetings.

Voice over IP (VoIP) technology is responsible for delivering voice and multimedia sessions over the internet. There has been amazing advances in this field that have improved audio quality over the internet. Even with all advances, packet loss still remains a major concern for performance degradation. With Catchpoint’s new custom monitor, we can now capture packet loss, jitter, and Round Trip Time (RTT) to measure the quality of an audio session over SIP. These metrics are used later to calculate Mean Opinion Score (MOS) for VoIP SIP call.

In this blog, we discuss the technology, protocols and metrics that the Catchpoint custom monitor uses to measure audio quality.

Voice over IP (VoIP)

Most organizations are switching from traditional phone systems to VoIP. It has a whole set of advantages when compared with traditional phone systems. First, the call rates are lower, and the rates do not increase when calling long distance. Second, it allows us to integrate different media types like document, image, and video along with the audio.

The media transported over the internet are encoded with one of the multiple available audio codecs and video codecs. Codecs convert the audio speech signal to a digital encoded signal. The video codec breaks the video into small chunks which are then compressed using an algorithm. Each codec has its own advantage and can be selected based on the use case, some implementations rely on narrowband and compressed speech, while others support high-fidelity stereo codecs. In recent years new hardware and algorithms are being designed to handle packet loss and improve voice and video quality.  The VoIP data is transferred over Real Time Protocol (RTP) .

Real Time Protocol (RTP)

RTP protocol was designed to provide real-time media over the IP network, it runs over UDP and at the transport layer. RTP packets are used when there is media transfer over the internet. The advantage RTP packets have over regular UDP packets is that it has a sequence number and a timestamp.  The sequence number allows us to organize the packets in a specific order with a timestamp to recognize when the packets were generated. This helps to rearrange the packets when they arrive out of order at the destination and to identify any missing packets.

Even with all these mechanisms in place, RTP data transmission suffers from packet loss and jitter. The Catchpoint custom monitor enables the capture of performance metrics during RTP transmission. These metrics can then be used to alert performance degradation. The metrics captured are:

  • Round Trip Time: Time taken from source to destination and back. Reported in milliseconds.
  • Packet loss: Percentage of total packet loss during the audio session.
  • Jitter TX/RX: Jitter is calculated based on the delay between packets that were expected to be delivered at a particular time, the metric is reported in milliseconds.
  • Mean Opinion Score (MOS) TX/RX: MOS is a score given based on the quality of the audio session. This code ranges from 1 to 5, 1 being bad and 5 being the best (Fig 1).
Fig 1. Mean Opinion Score quality scale.

When scoring an audio session, the max score of 5 is never provided. The MOS is calculated with the help of these metrics – Round Trip Time, jitter, and packet loss. The formula for calculation is from The image below illustrates the formula and calcualtion.

Fig 2. Formula to calculate Mean Opinion Score.

The data sent over the IP network is controlled by signaling protocols. A signaling protocol establishes, maintains, and terminates a call. There are multiple signaling protocols one can use, in this case, we rely on Session Initiation Protocol (SIP).

Session Initiation Protocol (SIP)

SIP is an application-layer control protocol that can establish, modify, and terminate multimedia sessions (conferences) such as Internet telephony calls.  SIP can also invite participants to already existing sessions, such as multicast conferences.  Media can be added to (and removed from) an existing session. Read more about SIP here RFC 3261 – SIP: Session Initiation Protocol.

Fig 3. SIP Session setup example with SIP trapezoid.

To capture the complete picture of the SIP performance, we have added 4 metrics. These metrics capture how much time was spent at different stages of a SIP session and helps to answers important questions like how much time was taken to connect or disconnect.

The Four metrics captured for SIP are listed below, we can reference fig 3 to understand each.

  • Session Request Delay (SDR): This metric is calculated when the client sends SIP INVITE request until it returns with 180 Ringing response. This indicates the time taken to ring a phone.
  • Session Negotiation Time (SNT): SNT is calculated at the client-side, between sending SIP INVITE request and receiving 200 OK. This highlights the time taken to ring plus answering a phone call.
  • Session Duration Time (SDT): This metrics marks the total time spent on the call. Where the RTP data was exchanged.
  • Session Disconnect Delay (SDD): The last one records the time taken to disconnect a call. The metric is captured between SIP BYE request and 200 OK (BYE) being received.

The custom monitor relies on SIP Simple Client SDK, find more details here. A sample audio session performance metrics are charted in the dashboard below (Fig 4).

Fig 4. Catchpoint dashboard for SIP and VoIP performance.

With a total of 8 different metrics to capture the performance of SIP and RTP, the custom monitor provides a comprehensive picture of a voice call. It helps to understand and analyze any performance degradation. The metric can also be used to trigger alerts when certain thresholds are met.

To help you get started with custom monitors, detailed instructions and codebase are available on GitHub.

Synthetic Monitoring
Network Reachability
VoIP and Video
Workforce Experience
Media and Entertainment
This is some text inside of a div block.

You might also like

Blog post

Prioritize Internet Performance Monitoring, urges EMA

Blog post

Traceroute InSession: A traceroute tool for modern networks

Blog post

The cost of inaction: A CIO’s primer on why investing in Internet Performance Monitoring can’t wait