Webinar

Preventing Outages: Map Your Internet Stack

With the rise of the cloud, the Internet is now your new network – and your new network is incredibly complex. The Internet is much more than your ISP connection. It’s made up of many components that are constantly changing, creating a fragile Internet ecosystem known as the Internet Stack. And if they have an issue outside of the subset of applications and systems you can observe, you’re going to be blind to it.

Application Performance Monitoring (APM) can only see parts of the Internet Stack, which is why most organizations have limited visibility into what’s going wrong. However, Catchpoint’s Internet Performance Monitoring (IPM) solution maps the entire Internet Stack from the outside-in so you can track every step of your user’s journey across the digital service delivery chain.

Watch Mark Towler and Shree Shirgurkar detail how you can improve user experience, prevent outages, improve MTTR and reduce tickets by monitoring the entire Internet Stack.

Video Transcript

Watch Mark Towler and Shree Shirgurkar detail how you can improve user experience, prevent outages, improve MTTR and reduce tickets by monitoring the entire Internet Stack.

Video Transcript

Mark Towler:

Thank you for joining us today. My name's Mark Towler. I'm the director of product marketing here today, and I'm joined by my friend and colleague, Shree Shirgurkar, who's the VP of Product Management here at Catchpoint. Today we're going to be talking to you about how you can prevent outages by mapping your Internet Stack.

So, why am I using the term preventing outages? Well, this webinar is one of a series. We not too long ago put together a white paper that's quite detailed and talks about preventing outages in 2023 and lessons that we can learn from recent failures. This webinar series is based on some of those lessons. We would strongly recommend that you check out this white paper. It is quite detailed, packed with some really useful information. You can find it on catchpoint.com or you can just simply scan that QR code there in the upper-right corner.

Watch the first webinar in this series – Preventing Outages: Monitor What Matters.

Today, obviously we're going to be talking about mapping your Internet Stack, but there are several other lessons learned and we've got a previous webinar that's out there as well as ones that are coming in the near future.

So, let's just talk for a second about the Internet and how fragile it is. Bear in mind that this was created almost as a science experiment way back when and was never intended to be the backbone of pretty much every part of modern life the way it is now. Some people have actually said it's pretty much held together with spit and bailing wire. It is tremendously complex, and this image is actually a representation of the number of connections and API calls that a single e-commerce webpage would need just to display the homepage.

Obviously, I'm not going to go into detail on this, but there are dozens of them, and if any single one of them is slow, or fails or doesn't work, you get a bad experience or worse, the webpage doesn't even load. And of course, if you're a customer trying to access that webpage, that's a problem. So, the issue here is with this level of fragility, failure is almost inevitable at some point. So, the problem we've got is what do we need to pay attention to when we're looking at how to make this section of the Internet or the Internet overall resilient? Tell me what you think, Shree.

Shree Shirgurkar:

Thanks, Mark. Let's look at the two main areas that are involved in delivering resilient digital customer-facing services such as websites or SaaS applications. The first one is obviously the code and infrastructure that you own and have full control over. While the second category is the third-party services that you rely upon. For example, the Internet infrastructure includes CDN, DNS, cloud. You also use SaaS applications. You use marketing technologies such as media, ads and certain images and videos, et cetera.

While APM or observability tools focus solely on the first category, they are necessary but not sufficient in detecting issues that impact your customers because of the reliance on these third-party services. And that's the reason you need to map and monitor the Internet Stack.

So, in summary, I think in 2023, what we find is that most digital services rely on a mesh of third-party services provided by other companies which provide key functionalities to the overall customer-facing service. That is what we call the Internet Stack, right? At the same time, these third-party services often cause bottlenecks and lack of transparency for SRE, DevOps and network operations teams.

So, let's get into the details of what we mean by the internet stack and the individual components. Let's begin with DNS. DNS, every time you type a URL into the browser, uniform resource locator into your browser, to access things like what happened with Ticketmaster, for example, where buying Taylor Swift concert tickets became very difficult for millions of people, the first step is always the DNS name resolution. Meaning Ticketmaster.com is converted into an IP address that can be accessed. Some of our customers use more than one third-party DNS provider for redundancy and in the paper preventing outages, in 2023, the white paper that Mark mentioned, we call out specifically lessons learned from a DNS resolution failure incident. And the three key learnings there are: monitor TC names, monitor TCP and not just UDP, and those are the two protocols of the Internet. And monitor for Anycast while monitoring DNS. So please go into the white paper and you can find more details around it.

Once the DNS resolution is complete, Most SaaS applications and websites use one or more content delivery networks, or what are called as CDNs, as shown in this picture. One of our customers, for example, uses 10 different CDNs to deliver their content globally. When monitoring CDNs, there are several factors to be considered: The CDN itself, then there is the origin where the content is available. DNS is involved again, SSL caching, routing, mapping of the end users to optimal locations. For example, users in Europe should be mapped to the CDN pops that are in Europe, taking an example there.

CDNs do not run your code and you have no control over their availability and/or performance. Moreover, CDNs also rely on the origin, which is typically hosted in a cloud provider. Now you've got to worry about not only CDN, DNS, but also the cloud. We can deep dive into each of these topics around CDN monitoring, but that perhaps deserves a separate CDN monitoring session that Mark and I will host at a later point of time.

Coming to routing on the Internet. While CDNs deliver content with high performance, routing on the Internet is controlled by routing protocols such as BGP, the border gateway protocol. With BGP, one needs to be aware of vulnerabilities such as BGP route leaks and hijacks. In our outages white paper, again, referring to the white paper Mark talked about, there are a couple of nice incidents and examples about how BGP misconfigurations can lead to outages, so please refer to that.

Finishing at the top, you see the media ad and SaaS applications. Marketing technologies, such as media ads, they're most commonly used in many websites and a lot of times these are delivered by third-party services. Another reliance on third-party services where you have little or no control. And finally, finishing at the top is the SaaS, which is Software-as-a-Service, the de facto application delivery model these days. Think of email, collaboration tools, CRM, all coming typically from third-party services.

In summary, what we find ourselves in 2023 is that most digital services rely upon this mesh of third-party services, what we call the Internet Stack. But at the same time, these third-party services often cause bottlenecks and lack of transparency for SRE, DevOps and network operation teams. With that introduction to the Internet Stack, I'm quickly going to show you a demo of some of the aspects we discussed, at the same time how Catchpoint enables our customers to monitor some of these things.

So, to begin with, what I'm showing here is the Catchpoint's global nodes network. One of our differentiators. We have over 2,000 nodes across the globe. And the reason for pointing out this global nodes footprint is that you want to test for your customers' experience of your website or applications from a point where the users or customers are present. And that's where we have our nodes present, and it gives you the ability to run proactive tests whereby detecting any of the issues with some of the components of the Internet Stack.

So, let's begin with the DNS as we did in the explanation of the Internet Stack. What you can see here is we are running DNS tests proactively, and this shows for the last seven days I have an experience score of 33, which is pretty low for this particular DNS service. Not only that, so by running several tests over these seven days, what we do is we bubble up an experience score, so at a glance you can see whether it is green, yellow, or red, and thereby take further actions.

Not only that, we also show key events here. As you can see, connection, attend time for a certain DNS server from certain locations. And then detailed metrics that are analyzed, analytics performed, and at the same time visualized in a fashion where you can take a quick look around. For example, how long did the DNS test take for a particular thing, some network metrics as well.

After DNS, we talked about CDN, so here we are testing an object on the Akamai CDN. And when I talked about CDN monitoring, again, we have a similar consistent approach where we give you an experience score, so at a glance you can see what's happening, key events, what are the errors and key events. Not only that, HTTP components matrix. Remember I was talking about the CDN matrix that you need to monitor, so a lot of visibility into different matrix and a large amount of time.

Last but not the least, with the recent press release we announced what is called as Internet Weather in our last press release just last week, I think. And there we talk about these components of the Internet Stack, the third-party services that you rely upon such as DNS, CDN and others. For example, right now it appears there is an Akamai outage ongoing in the Chicago area. Some impacted. Fastly is another CDN that is having some issues. So these regional and global issues, at a glance, you have the ability to see them, and that way it gives you an ability to understand the bottlenecks and performance issues with the individual components of the Internet Stack. And with that, I will pass it to Mark.

Mark Towler:

Thanks, Shree. That's really effective. In fact, the nice thing with the Internet Weather is it looks like it's a really easy way to say, "Hey, is this my fault or is this fault of something else on the Internet?" And pulling that up, you can see at a glance. I always think that's a really useful tool.

So, speaking of which, we talked about why you need to map the Internet stack. We talked about what the Internet Stack is. We showed you some of the insights you can gain. What are the advantages here? You're getting full visibility into the stack, right? You're going to be able to catch issues before they impact your business. And as you mentioned, Shree, you're going beyond what either APN or NPM application performance monitoring or network performance monitoring, can tell you.

Applications tend to be concerned with the top, maybe the second layer as well. There are the applications systems that I'm serving, are they working? And then at the bottom line, you're looking at hardware and transport, and that's the NPM piece. Is my LAN up? Is my WAN up? Is my wireless network working? Everything in between, as I said, is something that's going to be hard to see. And we've got visibility that we just showed you, but without that visibility, it's going to be very, very hard to be proactive. You're going to be reacting to issues. And of course we're talking globally here. This has to be all over the world.

And then there's one other thing that we talk about. It needs to be independent because, really, at the end of the day, Shree, can't we just outsource all of this to the cloud? Wouldn't that be a lot easier?

Shree Shirgurkar:

That's a great point you bring up. While I was at Akamai, I worked closely with some of the cloud providers and these mega pops cloud architectures, their data centers are housed in big cities and major locations while your customers are anywhere in the globe. And that makes you wonder whether when you monitor from the cloud, are you really understanding your customers' experience? And totally not the case. And the second piece is also when the clouds typically, as we called out in the white paper, there have been several outages to these cloud providers and you can single out one. Everybody has an outage. And when you are using cloud to monitor, that definitely becomes a fault, single point of failure, which you don't want to have.

Mark Towler:

Absolutely, there's two issues, two main issues here. One is the cloud is now telling you how the cloud is performing. You're basically letting the fox guard the henhouse. And it's not that they're even necessarily trying to be dishonest, they just may not be surfacing everything. They may not be as transparent and a lot of cases they don't know. And again, I think, Shree, we've experienced this, we've had customers who've used our solution and told the cloud, "Hey, you're really slow here in Chicago." For example. Or, "I'm noticing that your performance has been degrading over the last week in my area." And most of the time the cloud providers are like, "Oh, really? Oh, great, thanks. We didn't know that." Because either they're not monitoring or they're doing this all over the world and they're not paying attention to, as you mentioned, some of the edge cases or where some of your users may be.

So that's number one. And as you mentioned too, the other issue is when the cloud goes down, so does your monitoring of the cloud. All that visibility into the Internet Stack that you wanted, well, it's gone now. You can only see, again, your APM or maybe your NPM as well. Everything in between is a black hole.

We saw that actually about a year and a half ago. AWS went down. They're one of the big online cloud providers, and they took down a ton of global services with them. Amazon, Amazon Prime, Alexa, Venmo, Disney Plus, Instacart, Roku, Kindle, tons of other sites all went down. And again, these people didn't necessarily know that they were reliant on the cloud. If you're sitting here trying to bring up the Little Mermaid for your kids on Disney Plus and it's not working, kids are crying, you're calling your ISP and maybe Xfinity for example. And they don't necessarily have any control either because everything's coming out from the cloud. And if AWS can't serve it, all of a sudden everyone's hurt. So, if the cloud goes down, you don't know what happened. And that's one of the problems. So with all that said ...

Shree Shirgurkar:

Just that's one thing to add, Mark, there is our differentiation. At Catchpoint, as Mark mentioned, even if the clouds are down, any of the clouds are down, Catchpoint monitoring will be up. And that's one of our differentiations even when clouds go down, right? We are-

Mark Towler:

We're absolutely. Absolutely key. And that's the independence I mentioned on the previous slide as well. If you've got an independent network of nodes and you showed that in your demo right there, all over the world we're able to see everywhere. And so even if the cloud, for example, AWS is probably stronger in North America and Europe than it is in some other areas, but if you've got the ability to monitor from somewhere else, even if AWS doesn't realize it's out in, say, Australia or Fiji, you're going to be able to tell.

So, if you're able to do this and implement Internet Performance Monitoring and monitor the entire stack, you're going to get some major benefits out of this. The big one, which should be obvious by now, is you're going to see improved experience. Digital experience for everyone is going to get better, whether that's a user, whether that's a customer, whether that's one of your employees. And you're going to be able to see things coming, so you're going to be able to be proactive instead of reactive and prevent those outages before they start impacting those users.

And the big one is you're going to be able to find and fix problems faster. All right? Your MTTR, mean time to resolution, is going to go way, way, way down when you can see trouble coming or you can pinpoint it. We've had situations where customers have spent literally hours, in some cases months trying to track down a problem, especially if it's intermittent. And as soon as they've got visibility into the entire stack, or as soon as they can start tracking from the actual end user's point all the way back through every single hop they make through that stack, all of a sudden it's a lot easier to see the problem. We've had situations where techs have been literally working a problem on and off for a month and it was solved in less than an hour.

So that's really, really key because it's going to get that MTTR number way down. And of course it's also going to reduce the number of tickets. You're fixing problems, customers aren't complaining. You don't have the angry dad yelling at Xfinity because kids can't watch the Little Mermaid. And really, at the end of the day, as I mentioned, you don't have to rely on the cloud. We can do this separately and you don't have those big blind spots.

This is of course our opinion, but we're going to give you a little credibility here because our customers like us too. Honeywell tells us that they saw a 95% improvement in performance after implementing our IPM solution. SAP tells us that MTTR number I was mentioning went down by 90%. Blue Nile didn't give us any numbers, they just said they were finally having a good night's sleep, which is really the golden, that's the holy grail of IT. And we've got a large customer I can't name, they're an e-commerce and media conglomerate, but they're telling us that they're triaging problems six times faster, which is really, really huge.

So, at this point, what we're going to do is we're going to start taking some Q&A from you. We're about to wrap up a bit. But I did want to remind you this webinar is part of a series and we've got another one that's coming up in August. It's on the 23rd at 11:00 Eastern time. It's going to be about implementing an IPM plan. So, not only how to map your stack, but how do you go ahead and effect effectively ensure that all of your team and employees are capable of implementing this. You can register at catchpoint.com or feel free to scan that QR code right there. It'll take you right to the registration for that connection. And like I said, feel free at this time to start to adding questions into the Q&A or the chat and we'll handle them.

In the meantime. I've got a quick poll. I know not everyone can hang around, but let's see if we've got anyone who'd like to know more about Catchpoint. If you'd like to learn more, simply answer yes to the poll that I just opened up, and one of our team members will be in touch with you very, very soon. Whole point of this is if you're interested, great, you can now move on. If you want to stay for the Q&A, that's welcome as well.

And speaking of which, we're about to go into Q&A. It looks like there's a few that have popped up and we got a few before this session. But in the meantime, if you're going to leave, thank you very much. We appreciate your time, we appreciate you joining us. And I know that you're going to be seeing some form of attendee swag for this one. I don't know exactly what, but keep an eye on your email and you'll get the information you need to redeem that. And with that, if you'll give us just one second, we'll take a look at what's going on in our Q&A and see what we can answer for you.

All right, so Shree, you and I both know the answer to this one, but just got a question here: how many and what percentage of customers are actually relying on third-party services to deliver customer-facing digital services? And I think we both know the answer to that one, right? It's all of them, isn't it?

Shree Shirgurkar:

Indeed. Every time you have a customer-facing service, as we look through the Internet Stack, the first thing that happens is the DNS name resolution, so you rely on a DNS provider. And then if tour customers are in any parts of, for example, North America or globally anywhere, then the CDN comes into picture. And so, every single customer who has a customer-facing service has to rely on this Internet Stack as we discussed.

Mark Towler:

And it's not the way it used to be. I don't want to point any fingers, but you and I are probably old enough to remember when the Internet and the network for a company was in one big server room and that was 90% of what you were dealing with. And now everything's been outsourced to the cloud. So, been an interesting transition over the last few years, last 20 years, I'd say.

But really even the last few years we've seen with employees and workers exactly what's been going on with computing services. They've dispersed as well. Everyone has gone home or is far more mobile. And it's creating that challenge where you can't just say, "Okay, the office LAN is working and the office service are up." It's really the entire Internet is now your new network. So yeah, answer to that one is 100%.

And this one might be one for you, Shree, what are some key tenets for monitoring the Internet Stack?

Shree Shirgurkar:

Yeah, that's a great question. And I think there's three that stand to that stand out right away in my mind. The first one is you have to be proactive, because if you identify the problems as soon as they happen or even before they happen, it's the right thing to do because every minute and every second counts. I think one of our customers recently said, "If it is five minutes, we are dead. If there is an outage for five minutes, we are dead in terms of revenue impact and the customer impact." It's real. I think that's the proactive nature.

Being realtime is another important thing. So, some of our products, we make sure that the data is realtime. And the third and as important as well is the ability to monitor from where it matters. The previous webinar we did in this series, we talk about monitor from where it matters. So, the global distribution of your customers and employees mandates that you're able to monitor from any corner of the world. And that's where the global nodes network that we have at Catchpoint and allowing you to run those tests in a proactive way in real-time fashion are some of the key tenets that we're observing our customers are liking.

Mark Towler:

And that brings back what I was just talking about with the diversified and dispersed workforce and customer base. We remember a time when if you monitored most of North America you were fine. And that's simply not true anymore. You do have to monitor from where your customers are because everything can look up to you, but some poor guy in an airport in Dubai can't connect and doesn't know why and he's having a bad experience and that hurts a lot more than most people think.

Another question here, oh this, I'm going to summarize this one: how often should we do this? How often do we test the different parts of the united stack?

Shree Shirgurkar:

I think how often is, again, first as I answered the previous question, think doing it realtime in cases. For example, if you take the example of GBP, we talked about the border gateway protocol. One of the things is if the BGP routes are hijacked, for example, as soon as that happens, you are going to start, your customers are going to start seeing impact not able to reach your services. So, monitoring in real time and running some of these tests at the highest possible frequency from a globally distributed notes network is what you need in terms of making sure your customers are not impacted.

Mark Towler:

There's got to be some sort of happy medium, I'd imagine. Because obviously you don't want to be in a situation where you're monitoring every second. That's just going to cause a lot of unnecessary traffic and load. But depending on the importance of the service, depending on the history, frankly, if you've had problems in the past, and depending on where people are, that's going to determine whether or not you need to be monitoring every minute, every five minutes, every hour, a couple of times a day. What we've seen in a lot of cases is people who monitor more during what they know are historically busy times and they can afford to ramp down and maybe not run to do a test every five minutes when it's 2:00 in the morning and they know they've only got a handful of users.

So the answer is as much as possible really, and that's going to depend. But ideally, then as often as you can, because one thing we didn't really talk about is what Catchpoint does is track a lot of data and keep a lot of data. We don't aggregate data, we give it to you raw and granular and non-aggregated. And this sounds like it's a lot and it is, but not only does our analysis engine do a really good job of handling it gives you the details and the devil is usually in the details.

If you've got an issue that's intermittent and you're only testing every 10 minutes, well, it's going to be a lot harder to track that because if it doesn't happen during that one minute that you're testing, you're not going to know about it. But if you're testing every single minute and capturing every single piece of data from that, that makes a huge difference and it makes it a lot easier to find issues.

Now. here's one specific to us. And again, Shree, I don't know if you want to answer this or I could? But saying: if IPM isn't reliant upon the cloud, how does it monitor cloud services? I mean, that really comes down to our global observability network, doesn't it?

Shree Shirgurkar:

Exactly. Sorry, I have a dog. Hang on. Sorry if you heard the barking. But in terms of, I think the cloud, and Mark covered really nicely. A couple of things. A lot of vendors will monitor from the cloud and from what we know, the clouds have these mega pops, which are only in certain big cities, whereas your customers are spread across the globe in every possible small city or suburb or location. And that's where some of our differentiation is around last mile nodes, for example. So we do have last mile nodes that are the closest approximation of the end users, and from there the ability to run those tests.

And the second piece of it is also the SaaS offering, the software as a service offering, where when you use our monitoring services, it is really available even when the clouds go down because we are not hosted on one of the clouds compared to some of other vendors that you will find. And I think that's the dilemma, right? If you're observing something and if the cloud goes down and if you're not able to observe, then it's not an ideal situation for you and that's something, the second dimension.

Mark Towler:

If all of our users were in the cloud, we could just monitor the cloud, but that's not where the users are. We're actually getting really close on time here. So, if you've got a question that we didn't get a chance to answer or you didn't get a chance to enter in, by all means, feel free to get in touch with us. You can reach us really easily. Just email hello@catchpoint.com. Or you can go to catchpoint.com and there's a couple of different contact forms that you can get into, especially if you'd like to see another demo or like to learn a little bit more about the Catchpoint platform.

In the meantime, we want to be sensitive to your time, so thank you very much again for joining us today and for attending this session. As I said, keep your eye on your email to look for that swag and go and have a great rest of your week. Thanks very much, everybody.

Shree Shirgurkar:

Thank you. Thanks, everyone.