Blog Post

The Need for ‘Real ISP’ Visibility in your DEM Strategy

Updated

Published

July 2, 2020

mins read

in this blog post

Recently, one of my buddies moved to a new apartment. As an extreme binge-watcher, he tried all different combinations for positioning his 65-inch 4K TV and leather recliner in almost every single sq. ft of the living room.

The only flexibility he did not have was in choosing an Internet Service Provider (ISP) for his new house. As a backend dev professional, it did not even matter to him as long as he was getting a 150Mbps internet connection.

Being in the web performance space for the last 7 years, I have seen firsthand how ISPs play an important part in the end-user experience. Had I been in his position, I would ask myself –

Would my binge-watching experience remain the same?
How about my favorite news and media sites? Will browsing through them every morning be any different?
SaaS applications – O365, Salesforce, Zoom, will there be any impact on those during business hours?

There is no one ISP which is clearly better than the rest – it’s just how ISPs peer/interact with the different components of the application delivery chain – backbone network, the DNS, the CDNs, the hosting providers, the cloud, third party tags (and its DNS, CDN, cloud, etc) and how it can impact the three critical components of an end user’s digital experience – reachability, performance and reliability.

Let’s explore a few examples that illustrate how ISPs play a vital role in the overall digital experience of end users –

Example 1 – Reachability

Let’s add some clarity to the term “Reachability**“**. How is it different from availability?

A service can be up and running all the time in a data center or cloud, that is, it could be 100% available but may not be reachable from different parts of the world or from different networks.

Reachability is availability from the end user’s perspective. (what good is a system with 100% availability if its intended users cannot reach it)

In the scatterplot chart below, we plot the reachability data of an organization that has launched its services in China.

Blue dot represents – successful reach.
Red Diamonds – unable to reach.

Based on the scatterplot, China Mobile has the least reachability to this new service that the organization is trying to launch.

Does the organization need to invest time to make its service more reachable in China? Oh yes – China Mobile, is the leading wireless provider in China. This means the majority of consumers in China would face issues while trying to access the newly launched service. To be successful in China and to tackle local or well-established competitors, reachability is the starting point for improvement, followed by performance optimization.

Example 2 – Performance Impact

Performance not only varies across different geo-locations but also across different ISPs in the same region.

The chart above shows how the performance of a static HTTP asset varies across different ISPs in Miami, FL.

To identify the cause for ISP-based variation, we broke down the data by CDN PoP location (captured via CDN response-headers).

The breakdown shows –

Miami AT&T users are mapped to the CDN instance in Jacksonville, FL. and the HTTP response time went up to 4.5s.
When Miami Verizon and CenturyLink users are mapped to the CDN instance in Miami itself and the HTTP response time is always less than 0.5s.

Miami and Jacksonville are just 350 miles apart from one another so that should not be causing 6x-8x worse performance. Then what is causing it?

The network RTT (chart below) shows that the issue is at the network layer where the RTT between Miami AT&T and CDN PoP is high and volatile.

Examining the network path from Miami AT&T to CDN PoP, we see that the final hop, which is the link to the CDN PoP in Jacksonville, has network latency of 275ms. Thus, impacting the response times of user requests.

Is it important for organizations to know if there is a variation in the performance of their application across different ISPs?

Imagine if Netflix takes time to load a video on your mobile phone, for that point in time – would you blame your ISP, would you blame the underlying CDN or the DNS or the cloud provider that Netflix is using? Absolutely not – you would just put the blame on Netflix.

End users do not care about the underlying mechanism, it’s the experience outcome that matters.

Example 3 – When IP Packets Decide to go on a world Tour!

In this example, we analyze the HTTP ‘Wait’ time (TTFB) of a static asset across two key ISPs in Singapore – Singtel (Singapore Telecom) and Starhub. We can see a clear difference in the response time in the chart below.

The response time in Singtel is much better than that of Starhub. But, why is that happening?

Examining the network path between Starhub and the CDN PoP, we observe that the packets are routing away from Singapore to San Jose, USA and we see a corresponding increase in the overall RTT at that hop (#8)

I would be delighted to go on a world tour. However, I would not want my browser/app requests to do the same and introduce poor experience. A big NO.

Example 4 – Reliability

Reliability is delivering consistent application performance, availability and reachability. It is an ‘over a period of time’ measure to understand how application delivery is impacted by various parameters such as Time of the Day, Day of the Week, CDN/origin IP address, etc.

The chart below shows the performance of an application for China Telecom and China Unicom user over a period of 3 days.

We see the performance of the app in China Unicom is far better and consistent compared to China Telecom.

However, the added ability to slice/dice the data and analyze data over different parameters helps in uncovering interesting patterns (this way you do not have to be a data scientist to tell the story behind the data).

China Unicom – the performance of the application being delivered remains consistent over time around 4-5s.
China Telecom – the app performance worsens at around 9 AM CST (China Standard Time) and becomes worst at 8 PM CST. Performance starts getting better at 3 AM up till 9 AM CST.

The routine pattern of poor performance we see here can impact different businesses/industries:

Ecommerce – peak hour of shopping i.e. between 8 PM and 9 PM gets impacted by poor performance.
SaaS – poor experience for employees/professionals during the business hours 8 AM to 5 PM.
Entertainment – In general, people are less likely to watch a movie/episode between 4 AM and 8 AM. So no point in having the optimal experience during this timeframe while it is poor the rest of the day.
News, media, financial, etc. are some of the other industries impacted.

Conclusion

For organizations that are delivering digital services, if the end goal is to improve end-user experience, start from the end users. Monitor from where the end users are including the ISPs they use to determine the impact of your investments in CDN, DNS, cloud, etc. on the overall end-user experience.

If monitoring is done only from public cloud such as AWS, Azure, GCP, etc. you might miss out on critical issues that are impacting the reachability, reliability and performance of your services from the real user’s perspective.

Summary