Webinar

Achieving Full Visibility: Modern Monitoring for Distributed Cloud Applications

Today’s applications are hybrid, cloud-centric, service-oriented, API-dependent, and geographically distributed. The monitoring practices we relied on for decades are no longer sufficient. It is critical to monitor all the internet-centric dependencies, connectivity, and cloud application components – and to do so from the user’s perspective so IT operations teams can achieve digital resilience and deliver performance. This session will cover DEM, APM, and IPM and how they can work together to pinpoint issues before they occur, so users receive a great digital experience.

Video Transcript

Stan Gibson

00:05 - 00:59

When users are online, they want an experience that just works. Digital experience monitoring is a vital tool in crafting that outstanding user experience.

Application performance monitoring or APM has long been an important element in digital experience monitoring, but it is now becoming clear that more is needed. Internet performance monitoring or IPM takes a close look at what's happening across the network between the users and the applications.

Working together, IPM and APM provide the complete picture that's needed to assure a great digital experience. Hi.

I'm Stan Gibson for CIO Marketing Services, and I'll be your host and moderator for this video webcast sponsored by CatchPoint. Joining me to explore how IPM works and what it can do for your organization is Gerardo Dada, CMO and field CTO at CatchPoint.

Welcome, Gerardo. Great to have you here for this webcast.

Gerardo Dada

00:59 - 01:02

Thank you, Sam. It's my pleasure being here.

Thanks for your time.

Stan Gibson

01:02 - 01:22

Before Gerardo and I dive into our discussion, just a quick word to you and our audience. Now you're invited to take a look in the resources area of your webcast player at any time where you'll find useful information relating to today's topic.

Gerardo, to begin, application performance monitoring or APM is an important piece of the digital experience puzzle. Just what does APM encompass?

Gerardo Dada

01:22 - 02:31

That's a great question. So I think most of our audience will be familiar familiar with APM.

APM has been around for thirty years, started with Wybie, one of the first APM products out there. Now we have a company like Dynatrace, Datadog, Nuvralik, and many others, Honeycomb, Nuance, open source.

It's a it's a very well defined space that is focused mainly as as its name says on application performance monitor, meaning it looks at code traces, application infrastructure, and logs and events. Right? So it's really looking at at the engine of the of the software at and it's it's a technology that's been advancing over thirty years and it's gotten pretty good.

So now most APN tools are gone to a stage where they can do a really good job of helping developers optimize the code and helping operations teams making sure that code works properly. And, of course, most of these platforms have added more technologies like synthetic monitoring, which is useful for, for debugging and another technologies to monitor things like Kubernetes and cloud and networks, etcetera.

But the core of APM is monitoring that engine, the APM, the application itself.

Stan Gibson

02:31 - 02:43

Well, we are in the cloud era now, and you did mention cloud. So what are the biggest challenges in monitoring today's cloud native applications, and how do these challenges differ from what you faced with traditional applications?

Gerardo Dada

02:43 - 05:53

Yeah. So, you know, the cloud, it feels like we use it every single day for everything.

Everything we do depends on the Internet. Right? So if you if you think about, for example, a business that seems to be offline at a restaurant, You use maybe Yelp to find the right restaurant, then use Google Maps to get there, then you you get to the restaurant, then you use maybe a QR code to look at the menu, which is done on your phone.

The server might use a tablet to set up, your order, send it to the kitchen electronically. You get the bill printed on a on on a PDA, and then you you have to pay usually with a credit card or with some other electronic form of payment.

So, basically, today, if if the Internet was not working, if the cloud doesn't work for those applications, then then restaurants are not working anymore. But, you know, we we are in an age where we are all remote.

We're having this conversation over the Internet. Most of our work is done through the Internet.

Every application that we use is now not a standalone application like in the old days. I remember at a time, you know, maybe twenty years ago, twenty five years ago, you would go to work, there'll be everybody will be in the same office, and there'll be a room where you would have a file server, an exchange server for email, your your your print server.

Everything was local. Right? So IT was fixed on on those servers in the server room and the local area network because everything was local.

Now it's completely the opposite. Every application we use is hosted on the cloud.

It's using APIs and component that are hosted in different clouds. They have disaster recovery technologies that put different pieces of that application in different availability zones.

And every application now is a collection of services that are used combined. Like, if you look at the traditional ecommerce application today, right, like now the headless commerce, etcetera, A typical e commerce application will have dozens or hundreds of applications.

If you look at the bank, they have payment processing and security technologies and transactions via Zelle for for payment, you know, sending payments on some other places, etcetera. So a bank today or an IT person at a bank needs to manage hundreds of different disconnected, geographically distributed, hybrid, and complex technologies that are cloud based.

And each one of those things requires the Internet to work. So if you have an API, that API needs to be connected to DNS, to a hosting location.

There's things in the Internet that we typically don't think about like BGP or gateway protocol that is essential for things to work. And and if an IT department is not paying attention to all those things, then those dependencies, which sometimes are out of your control but you're still responsible for, those dependencies become really your Achilles heel.

Because, again, your applications are managed with APN. They're gonna be okay.

You're in control of those. But your application now depends on your IT and your business depends on all these different cloud technologies that are distributed all over the Internet.

That's why companies need to think about Internet performance monitoring.

Stan Gibson

05:53 - 06:03

Let's talk about metrics now. What are the key metrics that IT operations team should be looking at, and how are they different from the key metrics that have been important in the past?

Gerardo Dada

06:03 - 08:04

That's an interesting question because, traditionally, when when an IT monitoring team will be looking at at the infrastructure itself, right, and they will look at infrastructure metrics. So what could happen today is that IT will say, like, look.

Our servers look fine. Our application looks fine in our dashboards, and yet our customers are complaining.

Right? You would call IT and say, hey. My my my application doesn't work, and IT will say, like, everything looks green on my side.

And if it's users, then you will have customers or even an API that is not working, and and IT is only looking at the engine. It's almost like, you know, we're not getting to the destination we wanna get to, but, you know, the the, your your dashboard, your car looking at the engine, the gasoline to air ratio looks fine, the RPMs look fine, your oil level and temperature looks fine.

Yeah. That's not really what matters.

You don't you don't get in a car to make sure your oil is good if you get in a car to get to your destination in time. So applying the same philosophy, what really matters is the experience of end users.

Right? So what really matters for a bank is not that the server is running at 80% capacity or that, you know, CPU utilization is x or that latency between the LAN and the database or the wait times in the database are x or y. What really matters is that a a user, a consumer trying to make a deposit on the mobile device can achieve that within, let's say, five seconds.

What really matters is that when you go to a bank to deposit a check or to get money, you can actually the the person helping you can actually complete the transaction in their computer in in in a reasonable amount of time. And what matters is when you actually start transferring a business to pay your employees, for example, you want that transfer to happen effectively and not not to be down.

Right? So measuring the experience of your users is really important, which also means you need to measure experience from where the users are, not where your servers are located.

Stan Gibson

08:04 - 08:20

Well, that certainly makes sense. And you don't wanna be reactive solving things after problems occur.

So proactive monitoring is really crucial here in the cloud. What strategies you're using to move beyond reactive alerts and identify potential issues before they impact users?

Gerardo Dada

08:20 - 11:50

Yeah. That's that's important in in today's world because, you know, if you're using an an ATM tool, and just looking at infrastructure metrics and and only looking at your code metrics, let's say you're you're a large bank and you open your bank offices at 09:00 in the morning eastern time, right, in in the Eastern side of the country.

All those banks open at the same time. At 08:50, there's no transactions.

There's nothing to look at in your dashboard because there's you know, the application is basically waiting for people to get get to work. And and maybe there was a problem last night.

Maybe there was a challenge either on the application side. Maybe there was a problem in the connectivity on the region.

I'd like to imagine all your branches that you have in Boston that are connected to a provider x are having latency issues that are causing lots of problems. So by the time the bank opens, people show up.

Even if you look at NetFlow for network monitoring, there's no traffic. So everything looks great until the time where they start putting stuff on the computer.

Right? The the the the bank seller start putting stuff in a computer, things might not work or might be very slow, which by the way, today slows the new down. Right? So if if something is really slow, this is just as good as as down effectively speaking.

So at that time, the people in the branch get upset. They, you know, do you have a big line.

I think we all especially if you travel, you've heard the I'm sorry. My computer is slow.

It tends to happen for some reason at at bank debtors and and rental capers, at least for me. And and people, you know, today, companies are very competitive.

So a bad experience might be enough for people to switch. Recent research from Accenture found that, you know, following the example of a bank, one bad experience will get 50% of newer generations to switch banks because they they lost the trust in the in the bank to be able to do that.

The same happened when you're shopping online. Right? You're looking for I don't know.

I needed a hedge trimmer, last week, to fix my yard. And a site was slow, and I just switched to another site.

There's twenty, fifty different places where I can buy the same product at about the same price. So the the the ability for customers to switch vendors or companies is is very it becomes very, very easy nowadays.

So the frustration has really big impact on users. That's why being proactive is important.

Right? So the strategy is to use synthetic monitoring to monitor those from those locations all the time before the users actually are show up in your store. So as an example, we work with one of the leading banks and we simulate a tether from each one of the major locations.

So not from the cloud, not from some hypothetical location or from two or three places in The United States. We go to in each city, inside the bank, we test and simulate like a robot.

Like, it's really an agent that is simulated. Let's log into the system, make a transfer, make some other operations, log out.

Five minutes later, you do it again. So when a problem happens, we detect it immediately and we alert IT, and we can tell them how bad is the problem, what users are being impacted, which locations, etcetera, so you can actually have an opportunity to want to fix it.

And oftentimes, we have the ability to actually tell you where the problem is. In this case, you could say, like, look.

The problem is the connectivity that you have in all your your branches in the Boston area that are connected through provider x.

Stan Gibson

11:50 - 11:57

Roberto, many APM tools offer proactive synthetic monitoring. How is IPM different?

Gerardo Dada

11:57 - 16:52

That is very true. Like, a a lot of people we talk to initially when they first get exposed to IPM and to CashPoint in particular, they say, look, hey.

We all use synthetics. Right? I used to, work at at at Pingdom, and SolarWinds where we use synthetics and they're cheap.

And, synthetics is pretty much a a method of collecting data, but synthetics can get very complex. So I'll answer that by saying there are two things that make IPM different.

One is monitoring what matters, which means synthetics is is basically a simulation of of a transaction. Right? And and so in our case at Cashpoint and any other ITM company, we pay particular attention to the Internet centric technologies, protocols, and and types of tests.

So we can go deeper into monitoring, let's say, your CDN or your DNS, your SSL certificates than anybody else. We're the only company in the planet that can actually do real time BGP data.

You might remember, for example, three years ago, Meta had a big incident, and Facebook was down, Instagram was down, WhatsApp was down for everybody all over the world. What happened is BGP, which is basically like the ZIP code system of the Internet, had an issue with the Meta ASN.

So everything at Meta was down, their email, access to the data center. So even IT could not even go into the data centers.

And so now companies like Mera rely on on CashPoint to be able to monitor that BGP technology in real time to make sure that nothing happens. There's no hijacks, for example.

Because, you know, cyber criminals cannot can use BGP to steal traffic. It's happened with cryptocurrencies.

Right? They steal the traffic going to to them, and they can steal the the cryptocurrency. It's all happened.

It it also can be human error as well. So that's the first thing, being able to monitor what matters, being able to go deep into all the technologies in the Internet, which we monitor everything from QUIC, and HTTP three, which is some of the new technologies to MQTT, which is used for IoT devices.

We've done that for fifteen years. We we've been working with some of the leading companies in in the world to monitor ECN, which is not a known but a more commonly used protocol to avoid traffic congestion.

And and would do a really good job of monitoring some some of the foundational technologies of the Internet. So that's the first part, monitoring what matters in the Internet.

The second part is monitoring from where it matters, which is what I mentioned in that most of these APM tools have a few dozen synthetic locations where they can monitor from. And most often, like, 99% of them are in the cloud, meaning in the AWS, Azure, or Google data center or some type, I don't know, all other hyperscaler data centers.

The thing is, if you're trying to understand what is experience of your customers, your customers are not in a in an AWS or a hyperscaler data center evidently. They don't have the same connectivity.

They don't have the same bandwidth. They don't have the same constraints and challenges.

So it it becomes to an extent, it becomes useless in understanding what is that real experience of the users. As an example, most APM vendors don't have any nodes in China, not not at all.

I was talking to, one of our customers, in Mexico. They don't have any notes in Mexico either.

We have a couple dozen notes in Mexico. We have over a 100 notes in China.

So we can tell you what is experience. If you're a luxury vendor, for example, China is usually the largest market for luxury vendors.

If you're a luxury retailer, let's say, you're Porsche, China is a really important market for you. We can tell you what is experience of your dealers, of your end users, of your own employees connected through China Molesoft versus somebody connecting a mobile device, if there's somebody connecting a cable by different locations around, in different cities around China.

We could tell you what is the impact of the great firewall of China on your applications, what are the latency. We can tell you how that changes over time.

We can tell you which ISPs are more efficient, etcetera, etcetera. Which of your CDNs are giving your users a better experience? A traditional APM system is is completely blind to all those things.

They have no context, no geographical context. And and today, CatchPoint has, it's we're right about 3,000 global agents in over a 105 different countries.

Plus our customers can deploy their own agents in their in their in their locations. Right? So the bank I was talking about has deployed their own agents inside their their branches.

We have manufacturing companies that deploy them in their manufacturing facilities, etcetera. So monitoring what matters and monitoring from where it matters is really important when you care about user experience and being proactive, and that's that's how IPN is different.

Stan Gibson

16:52 - 17:08

However, today's distributed systems are certainly quite complex. How are you currently addressing the need for end to end visibility across all the dependencies in your application stack, including, let's say, third party services and APIs.

Yeah. It's it's a interesting term you.

Gerardo Dada

17:08 - 22:16

you mentioned there, end to end visibility because I think a a lot of people in technology, especially marketers, we like using the term end to end, and full stack and say we do everything. Right? But it's all based on what's your context.

A lot of time when people say end to end, well, they mean, well, within the application stack. So in in this when we say end to end, we measure everything from the end user.

In in my case, I have a cash flow agent on my on my laptop. It's measuring the basic health of my laptop, the connectivity through Wi Fi to my router, router to the Internet Internet through the backbone of the Internet, through my ISP, all the way to the applications I wanna use.

And and in the case of custom applications, we actually have tracing capability to do synthetic tracing into the code and into the, even the database, right, using OpenTelemetry. At the same time, our service might be using, security services from a third party or this application might be hosted in the cloud.

We're monitoring all those third party dependencies. And what's what's really cool is that now we are showing all those things in what we call the Internet stack map.

So for for many years, IT has been used to looking at dependency maps. And those dependency maps are typically locally based.

Right? Those only about your application It shows you in the most simplest way, like your web server, your application server cluster, your databases. Sometimes it shows your network devices.

Sometimes it shows, services within your application. But now with Stackmap, we can show everything from your dependencies to DNS, to your cloud providers, to your, you know, third party services that you might use, etcetera.

And it relies on also a service that we have called Sonar, which is continuously monitoring all hundreds of services around the Internet. And we use AI to make sure that we detect when it's really just a failure of a test versus when it's really a service is down.

So for example, a couple of months back, I started getting messages from my own team saying our website is down. Right? And effectively, cashpoint.

com was down. What you typically do is you start looking at your application, you call if you're a big retailer, you call a whole room, pagers start going off.

And in my case, we I opened this application and saw, like, oh, look. Amazon east is down.

The hosting product used for our website is in Amazon east. So there's no need to open tickets or to, do anything just to wait for Amazon to fix it.

Right? So at least I knew what the problem was immediately, and I knew what I didn't have to take any action. If you were a bigger company, you would probably have multiple availability zones and something like that.

Right? The hosting company we use by Flowton is one of the largest ones. Could have done a little bit better to manage that dependency, but at least I I knew what the problem was.

Similar example, we about the year I think it was a year and a half ago where a lot of the retailers out there use a technology called Adobe Tech Manager. Right? And and it's not something they control or they monitor.

So one one moment, the Adobe Tech Manager went down, stopped working, and became, an incident for all those retailers and other websites that were relying on the Adobe Tech Manager, making those sites slow on YouTube. So our customers we have a lot of customers, retail, some of the largest ecommerce platforms and the largest retailers use CatchPoint.

They get an alert saying, look, Adobe tag managers went down, turn it off. It's a dependency.

You cannot fix it. Just turn it off in your website.

It's better to run your site without the tag manager then not having a website. They turned it off, business as usual, customers didn't notice, management didn't notice, their back account didn't notice.

Other sites, they started having problems, and they didn't have this technology to continuously monitor the third party dependencies. So they started creating a major incident.

They called our own room. They have, hey.

Let's check the database. Let's check the network.

Let's check our providers. And it's almost like an error where you go one by one triage trying to find what is the root cause of the problem.

Let's go check a w tech manager so that if you go to the status page, the status page was not updated for, like, forty minutes. Down detector took him about an hour.

Right? Because down detector relies on users to report problems. So so this this company is where we're having significant issues because they cannot pinpoint the problem.

And and many of our customers that were monitoring the tag manager using our some of our service were using AI to tell them, look. Your site has an issue.

Here's incident. Fix it.

Problem solved within a minute. Right? That that is the difference, of being able to be proactive and be able to, have that visual end to end visibility for everything that really impacts your application.

And it call separate from hosting providers to DNS to APIs to anything that you wanna monitor because our our monitoring can monitor basically anything through scripts.

Stan Gibson

22:16 - 22:27

Gerardo, many organizations are adopting observability practices. How do you define observability in the context of your cloud applications, and what key metrics and signals are you prioritizing?

Gerardo Dada

22:27 - 24:12

Yeah. So so I think in the context of observability, it's all become a bad word because it it meant to be kind of an evolution of monitoring, and people just think about it as a fancy word for monitoring.

Second, the definition was very APM centric. But if you think about all solubility as really being able to understand the entire system and its implications and what's causing problems, We see, a very important trend in in some of the largest customers.

We we serve as CashCall, we serve we have only, you know, a couple 100 customers, but we serve some of the larger companies in the world. And something we see more and more often is one, companies are consolidating into a central monitoring team that either does the monitoring or establishes governance and best practices for all the IT teams to do the monitoring.

The second, is consolidation. People wanna save money.

APM and observability can get really expensive, and they're trying to consolidate into a single tool. They're trying to make sure they're adopting new technologies and new tools like OpenTelemetry.

And having a smaller number of tools makes it easier to manage and to train and to negotiate contracts. And the third one is we see more and more people using APM with IPM working together to create really observability magic.

So we have one of the largest companies in Germany that is using Dynatrace, and CatchPoint in their system. And they actually their team is called AIPN squared, And they went from something like 300 incidents a month, I believe, to zero.

And they're achieving this by consolidating and having this that now truly end to end visibility, which is a combination of APM plus IPM.

‍

Stan Gibson

24:12 - 24:26

Frodo, let me ask you this. Can companies expect shorter Mean Time to Identify or MTTI and Mean Time to Resolve or MTTR? And how about achieving better XLOs or experience level objectives?

Gerardo Dada

24:26 - 28:12

Yeah. So absolutely.

Like the example I just mentioned for this this German company who runs the websites and ecommerce sites for thousands of vendors, we see this all the time. Like, when when our most of our customers, they usually tell us Cashpoint was the first thing to to let us know that there was an incident.

Right now, we're working with one of the largest social media companies who found a problem with Cashpoint, and they said, like, how come none of our other systems detected this? Right? So that ability to first identify that there is a problem, then there's people call MGT also mean time to innocence. Right? It's not my problem.

It's not the network or it's not the application. Basically, it means identify the root issue and then need time to resolve.

Once you find not only that there is a problem, but what is likely the source of the problem, we also use AI to do root cause identification as we see this end to end capability for an application. We we we have a very good technology to help you identify what's likely the root cause you can go fix it and reduce mean time to resolve.

And there's a new one that we called MTTV, which is mean time to validate because we are doing this ongoing proactive checking. Once you implemented the the the fix, we can within minutes tell you, yes.

This actually has improved the end user experience or the customer experience or the API experience. And and so all those things are are very commonly shorter when we use PN.

That's that's the standard with our all of our customers. That's how we prove our value.

That's our mission in life. That's what Mehdi, our CEO, would say.

Reducing MTTI is the mission why we created cash flow. And then you mentioned XLOs, which is interesting because in the industry, there's a lot of SLA, service level agreements, which is what you expect from from a vendor to give you or you expect from another team inside the company.

Right? Like, I want four nines or something like that. There's SLO which is a service level objective.

But more and more companies are starting to adopt XLOs. An XLO is an experience level objective.

And what's unique about it, it goes back to the conversation at the beginning, which is I wanna make sure if I'm an online retailer that my customers in this 10, which are are 10 top geographic locations, I don't know, could be China, New York, Paris, etcetera, can complete the transaction within five sec seconds. I wanna make sure that my bank tellers can do this within this time in 95% of instances, or I wanna make sure that my payment API can complete a payment transaction within five hundred milliseconds.

If you're trading if your bank is doing trading, it might be milliseconds for completing a trade. If you're in health care, I wanna make sure we achieve that the doctors serving our patients in life and death situations can can operate, can can look at the the their medical files and can work 99.

999% of the time from each one of our hospitals. Right? So those experience level objectives set the bar for what IT is really meant to do, which is building the systems that power the business that allow us to do things.

But they also allow you to align IT with the business. So then the investments you're making in IT are having a direct impact to the experience of your users or your employees, your customers, your technology, which has a direct impact on your business.

Right? So that we think XLA personally, we think XLAs are are a great tool to unifying IT in the business to prove the value of investments and more importantly to ensure that a company is paying attention to what matters, which is end user experience.

Stan Gibson

28:12 - 28:17

How can customers get started with CatchPoint IPM? Can they add it to what they have now? Yeah.

Gerardo Dada

28:17 - 30:06

Like I said, IPM is not exclusive. It doesn't replace anything because we're looking at usually at Internet.

It it's very easy oftentimes to start monitoring things without having to implement the change in SDK or implement things. Sometimes you need to deploy agents inside your network.

We do that, like I said, inside banks, inside governments. We have, what I mean, in the German government.

Right? So we pass the strict, privacy and security regulation. So it it it's easy to do that.

We we recognize that for in in this industry of observability, a lot of companies do basically like, start using product and they'll send you an invoice at the end of the month or at the end of the year. And they we've heard stories of, surprises that are six or seven, sometimes even eight digit surprises.

We don't wanna do that. So we create a package that we call App Assure, which is basically we look for for about $38,000.

We'll give you all of our technology to monitor one key application, and that includes synthetic tests and API tests and DNS tests and and a bunch of different technologies including this dependency map. It's not gonna be completely complete visibility across all the regions and all the tests you're gonna do, but it's super easy to get started.

So if you're you're a buyer, instead of saying I need to make a million dollar investment, and I don't know I'm gonna get overages on top of that and I hope this investment pays off, your risk is a little bit to 30,000. Right? So instead of spending six months designing the right solution, we want people to get started, start seeing value from our product.

And based on the value you see, you start adding more tests and more more regions or more use cases, more applications. And that way, you have more confidence in the solution.

You're seeing value immediately from the solution, and you can grow with confidence that there are not gonna be any financial surprises.

Stan Gibson

30:06 - 30:11

Prorardo, we're almost at the end of our time. What are the key points that members of our audience should remember?

Gerardo Dada

30:11 - 31:21

Well, the key point is, I would say, the world has changed, and everything we do in our professional and personal lives depends on the Internet. The Internet is now complex, distributed, hybrid, service oriented, so it's important for companies to look at IPN technologies.

We're not the only one. We collectively were were were the best.

But there are other technologies out there. So I really encourage any any person in IT to seriously take a look at Internet performance monitoring.

And second, pay attention to your user experience. Set that as your goal, as your ultimate objective, implement XLS, and and if you do those things, we've seen that IT get better and get better alignment to the business.

And, ultimately, what we want is what we call, we want companies to achieve Internet resilience, meaning you can be resilient against any risk, any incident, any problem out there. Of course, you're not gonna be a 100% immune.

Nobody can can promise that. But at least you will be able to catch incidents faster, respond faster, find the root cause faster, and and be deliver better performance and better resilience to older digital systems.

Stan Gibson

31:21 - 31:47

That's great information, Gerardo. Thank you.

Well, that is all the time we have in this video webcast. Many thanks to Gerardo Dada, field CTO at CatchPoint, for explaining the pivotal role of IPM in delivering a great digital experience by closely monitoring Internet traffic globally.

For more information on this topic, please do see the resources area of your webcast player, and thanks for joining us. For CatchPoint and CIO, I'm Stan Gibson.

Gerardo Dada

31:47 - 31:49

Thank you, Stan. Thanks, everybody.

‍