Webinar

Resilience through Chaos: Integrating IPM into your Observable SDLC (OSDLC)

The idea of deliberately introducing failures into your systems to test for weaknesses—what’s known as chaos engineering—can be intimidating. But what if you could evaluate resilience without actually breaking anything?

Chaos engineering shouldn’t be about throwing your systems into disarray, but about implementing intentional, manageable disruptions or degradations to ensure resilience. The key to success lies in integrating Internet Performance Monitoring (IPM) into your CI/CD pipeline. This method allows you to pinpoint bottlenecks, continuously test and improve your systems, and enhance reliability as early as the development stage. We’ll show you how to use Internet Performance Monitoring as a powerful tool to strengthen your systems against unexpected failures or degradations to ensure seamless user experiences.

Key takeaways:

  • Learn how to build resilient internet-facing systems through an IPM-informed approach to CI/CD
  • Explore how chaos engineering can drive reliability in the face of disruptions
  • Discover how techniques such as incorporating experience level objectives (XLOs) and proactive testing can make your life better
Register Now
Webinar

Resilience through Chaos: Integrating IPM into your Observable SDLC (OSDLC)

Register Now

The idea of deliberately introducing failures into your systems to test for weaknesses—what’s known as chaos engineering—can be intimidating. But what if you could evaluate resilience without actually breaking anything?

Chaos engineering shouldn’t be about throwing your systems into disarray, but about implementing intentional, manageable disruptions or degradations to ensure resilience. The key to success lies in integrating Internet Performance Monitoring (IPM) into your CI/CD pipeline. This method allows you to pinpoint bottlenecks, continuously test and improve your systems, and enhance reliability as early as the development stage. We’ll show you how to use Internet Performance Monitoring as a powerful tool to strengthen your systems against unexpected failures or degradations to ensure seamless user experiences.

Key takeaways:

  • Learn how to build resilient internet-facing systems through an IPM-informed approach to CI/CD
  • Explore how chaos engineering can drive reliability in the face of disruptions
  • Discover how techniques such as incorporating experience level objectives (XLOs) and proactive testing can make your life better
Video Transcript

Jared H

00:02 - 01:33

Good morning, good afternoon, or good evening, depending on where you're tuning in from today. We wanna thank you for joining this TechStrong learning experience here with CatchPoint.

Couple of quick housekeeping notes. This session is gonna be recorded.

It will be available afterward on demand, and we will send you a link to that recording once we conclude. We do have some resources, I'm sorry, some handouts in the resource section on the left side of your screen.

Go ahead and click those, download those, use those during the presentation. We want you to follow along with those as well.

We've also got some polls that we're going to, have throughout the presentation, so make sure you're sticking around, and paying attention for when those pop up on your screen. Please go ahead and and send any of your questions to the q and a section, which is on the right side of your screen.

Make sure you get those in, so that we can answer those at the end of the program. Any chats that you wanna do, tell us where you're tuning in from.

Share your experiences, your comments, your understanding on the matter. We do wanna engage with you, so make sure you're using the chat throughout the session.

Just, you know, first first off, just tell us where you're tuning in from today and and and and what brings you here, why you're joining us. Let's go ahead and kick off today's topic.

We've got resilience through chaos, integrating IPM into your observable SBLC. And today, we're joined with Sergey Khatsev, VP of engineering at CachePoint, and Leo Vasilu, former DevOps practitioner and, author of the SRE report at CachePoint.

Sergei, Leo, thank you for joining us. If you guys wanna go ahead and take it away from here, I'm gonna turn my camera and microphone off.

Sergey Katsev

01:33 - 01:36

Thank you so much.

Leo Vasiliou

01:36 - 02:38

Sounds good. Thank you for the introduction, Jared.

Always a pleasure to be with you and the team at TechStrong. Serge, I guess, just, logistics as my screen share coming through.

Okay. Can you see me through the slides? Perfect.

So let's go ahead and advance it and, jump in. As Jared said, Leo Vasilou.

And I would consider this to be a very targeted, focused, crisp, efficient webinar. Really, all we're gonna do is break down some key concepts to make sure we're talking about the same thing, we're all on the same page.

We are very excited to do a live demonstration to see those concepts in action, then a few words on the what comes next because it's not a one and done. Serge, did I miss anything? Do you wanna say anything about yourself before, before we continue? No.

Sergey Katsev

02:38 - 02:48

I'll just add that, yes, excited. But, of course, with live demos, right, there's all sorts of trepidations, but, absolutely excited.

Leo Vasiliou

02:48 - 03:37

Agreed. Agreed.

So intro and concepts, Like I said, just to make sure we're talking about the same thing. Before we jump in, let's just take a moment before we level set on the concept.

Serge, why, why have this talk at all? Well, in my opinion, I think it's because the products and services that our companies and organizations sell, generate the revenue that pay our rents and mortgages. So we need to ensure reliable, CICD, so we can remain competitive and pay said rents and mortgages.

But, unfortunately, that messaging didn't get approved during the, review cycle. So, Serge, maybe, maybe you can tell us why we're here, why we're having this talk at all.

Sergey Katsev

03:37 - 04:07

So, I think we're going to repeat it a a couple of times today, but complexity, complexity, complexity, dependencies, dependencies. Right? The the Internet's a very, complex place, and you don't own most of the services that, that you operate.

So it's really important to understand what's going on so that the ones you do work the right way.

Leo Vasiliou

04:07 - 05:23

Yeah. And so, you know, these are some of the things that we hear.

So we have the the fortune of being able to talk to our our customers. Right? So take any one of these, because there's, there there's really one main thing here aside from the words on the screen, and that is IT and the business must be partners.

Right? They must align. So take any of these flakable, flakable, that's the combination of flaky and unreliable.

A flaky, unreliable automation's layout. IT is layout.

It's gonna increase our manual work, increase risk, increase tech debt. Whereas the business is saying, actually, that's gonna be a higher chance of releasing bugs to production, leading to customer, satisfaction.

Right? Dissatisfaction. But regardless of, what we have here, so a lack of visibility, can't connect, can't even do our work, siloed, tooling, telemetries, that's the, I'm not seeing this issue on my end conversation or it worked fine, in the cloud, but the business is saying it's not working fine for the users.

The main takeaway here is that IT and the business must be partners, must align on the goals, and what is important. And, of course,.

Sergey Katsev

05:23 - 05:45

the flip side is also true. Right? The the business side may say, hey.

If you do this, there's a higher chance of customer dissatisfaction, or they might not connect the dots, and then it's up to the IT practitioner to remind the business what it is that they care about. Right? Hey.

We need the budget for these things because otherwise dot dot dot.

Leo Vasiliou

05:45 - 08:27

Nicely said. Nicely said.

Alright. So let's, continue on here and get to, some of those concepts like we were discussing.

So resilience through chaos, integrating Internet performance monitoring, IPM, into your observable software development life cycle, OSDLC. So let's break this down to make sure we're all thinking of these terms in the same way.

It starts with the Internet stack, which is the collection of technologies, systems, and services that make possible and impact every digital user experience. And it's an entire chain of components from core protocols like DNS and BGP to platform providers like your APN providers, your CDN providers, across points of interconnection, and all the way out to the last mile, home, and residential ISPs.

It is separate and distinct from your application stack, and the Internet stack is different for different users in different parts of the world. And I like to think of it as if your application stack is what generates the code, the Internet stack is what gets the code to the users where they are in the world.

Having said that, what is Internet performance monitoring? For us, it is about visibility into that stack to catch issues across, your customer base, your workforce, your applications, your networks, your websites before those issues impact your business. And in order to do that, we have to preserve these four critical properties of the Internet stack despite adverse conditions, the truest, shortest definition of resilience.

So resilience through chaos, integrating Internet performance monitoring or IPM into your OS DLC. And Serge, I think maybe people probably get, have already have an idea about, like, uptime, availability, performance.

Is it fast or slow? But real quick, just to kinda talk about reachability, if you wanna help me out here. But I think of it as what good is a brightly burning sun if its rays cannot reach you on the beach on a cloudy, stormy day? That's my go to analogy.

What good is a highly performing, highly available cluster of services in the cloud if those products and services cannot reach users wherever they are, whenever they are. Right?

Sergey Katsev

08:27 - 09:00

Yeah. Absolutely.

And, again, the flip side. Right? So, yes, can the service reach the users, but probably more importantly, can the reach can the users reach the service? And so as a developer, as, as a product manager, product owner, our goal is to make sure to think about where are all of these various people using our product from and make sure that we think about the reachability of those customers, of those users to the applications we're delivering.

Leo Vasiliou

09:00 - 09:07

And, Serge, since you've got the microphone, maybe you could kind of talk us through, the other, key concept.

Sergey Katsev

09:07 - 11:08

Yeah. Absolutely.

So I'll I'll I'll talk about CICD in a second. The other thing I just wanted to say is, the IPM monitoring, let's call it KPIs, are going to be different for different people.

So just like in CICD, there are different stakeholders that, that are involved in different parts of the software development life cycle. Right? And so we have this as the CICD sort of continuum.

It could be the SDLC or OSDLC continuum. It could be, the, software development pipeline, different names, agile, different names for the same thing.

But if you would, skip to the next one, please, Leo, the the goal is to make sure that you you and all of the various stakeholders are thinking about all of these things in particular. Is the application working properly along the whole way? And we we've talked about this in in previous sessions, but from the very planning phase where you're kind of defining your IPM strategy, what is it that I care about? Right? Do I care about just people in New York where I'm located? Do I care about people in, let's say, Tokyo? Do I care about speed, or do I only care about availability? Right? All of these different things turn them into business KPIs or or rather take the business KPIs and convert them to, the observability, the IPM strategy.

And then that flows through the rest of the CICD process here. Right? Applying I, IPM as code, well, that's where integrations and APIs come in, and I'll show you that in a second in the demo.

Dependency analysis, super duper important, has to be done. And then, of course, the feedback loop.

Right? How are you making sure that everyone along the way is using the same KPIs, is monitoring the same way, and that you're improving the system as as it changes? And.

Leo Vasiliou

11:08 - 12:51

what I'll add here is we we have conversations about, you know, shifting left or shifting right, and of this webinar is really, I guess, the idea about doing both, shifting wide as we, have coined the term, and that is to kind of integrate IPM through the entire life cycle. I'm very excited for the demo, and so we'll just kind of talk through this one last slide here before we hop right in.

So resilience through chaos. Right? So we've talked about Internet performance monitoring.

We've talked about CICD. Resilience through chaos, embedding into your entire ICD.

What is resilience through chaos? So, we are talking about the idea of chaos engineering to ensure resilience. The caveat there is in addition to just injecting complete failures that injecting performance degradation is damn near nearly as powerful and effective for these chaos experiments without actually bringing the system down.

Yeah. So kind of take that as a critical, way to think about, because in my opinion, calling something chaos engineering probably on paper does not sound like a risky thing that a layperson in the business who holds the budget would be like, yeah.

Let's do some chaos.

Sergey Katsev

12:51 - 13:21

Yeah. Exactly.

The the quote, unquote real way or the original way to do chaos engineering is to walk into a data center and pull a wire and see what happens. Right? Nowadays, probably not too many people doing that.

But what we're going to walk through with the demo is you could actually do exactly that without impacting the application that's really running in production, but then still seeing what those, you know, chaos results would be.

Leo Vasiliou

13:21 - 14:28

The downstream effects. Right? So you've got, and all we're trying to say here on this slide is, like, you've got all your things on the left, all your inputs, all the components of your application, what it takes to get those experiences to your end users, and will the output be good, that is the top, or will it be bad, that is the bottom? Yeah.

That's what we aim to answer with chaos experiments in the broader, resilience testing practices. And excuse me.

So, you know, what's that expression? I don't I don't fail. I I learn what doesn't work and and then try again or, I I don't fail.

I only win or or learn or whatever it is. Right? So what we're trying to say is that each of these experiments kind of adds to your shield law.

Like I said, the type of what doesn't kill you makes you stronger, I believe. And, there's only one piece missing.

Right? That is you, the the people who are giving us your precious time today and listening to, listening to this webinar and watching, this recording.

Sergey Katsev

14:28 - 14:31

Yeah. So let's jump right in.

Leo Vasiliou

14:31 - 14:53

Let's jump right in. So So we'll go ahead and I'll stop sharing these slides.

We'll have Serge share his screen and, see these concepts in action. So, sir, it looks pretty good.

I see demo app. It's coming through fine.

Sergey Katsev

14:53 - 18:17

Yep. Alright.

So what are we going to talk about first? So we're going to do kind of the two halves of of, this demo the same way that Leo just talked through, the the introduction slides. And that is, first, we're going to create an application, and we're going to set up a observable SDLC around that application.

Everything's gonna be working wonderfully, and then we're going to break it by applying chaos engineering to see what actually, happens when we start playing around. So the application itself, we have this super simple application as you can see.

It loads an image. It says demo app, and there's a button that says call back end.

And when I click that button, it pops up a message. So there there actually is a back end application to this front end, which returns this message.

And the last change that was committed against this application was, several days ago on June 19. So great.

Super simple application. So the first thing to do when designing an observable SDLC is let's think about what the dependencies are.

And this is kind of interesting, just my own personal experience setting up this demo. I started doing the dependencies on paper, and then I said, okay.

That's that's silly. Let me do it using Visio.

And then it got too complicated, and so I said, wait. This is entirely too silly.

Let's just use CatchPoint's stack map. And so CatchPoint has this great diagramming, visualization feature called stack map, and that is not what this webinar is about, so I won't dwell on it.

But here is what I created. And so you see here that the application itself well, you don't see this here, but the application itself is a React front end with a Node.

Js back end. It's running in AWS.

It has dependencies on DNS, and it has dependencies on that big image with the dots is loaded from actually the CatchPoint corporate website, which runs using Webflow, and so that's a dependency. So great.

I went and drew all of that out. Right? And I like to think of these things as what is the data flow for particular users.

And so as a customer, I would go to this demo application URL, which then goes to CloudFlare because that's the CDN, which then sends me to a front end load balancer. The front end load balancer, by the way, uses CloudFlare DNS for the, for the domain name lookup.

The load balancer sends me to the front end application itself. All of these four are docker containers.

The front end application loads or interacts with the back end application through a load balancer, and it has this dependency on workflow, which we already talked about. Right.

So such a simple application, again, literally, it's a screen with one image and a button.

Leo Vasiliou

18:17 - 18:18

Yeah. Search and.

Sergey Katsev

18:18 - 18:19

this is what you see.

Leo Vasiliou

18:19 - 19:15

Yeah. I was just gonna say, I don't know if if that's where you're going with it, but, what I heard is what I heard, what I saw is that, if you go back to that app, you don't you don't have to, but, you know, an image with a with a button to click, may seems like fake.

Right? But the architecture, the modern architecture that you were talking about talking through with that service map of your front end components, that is very real. Yeah.

So we absolutely we ask for to not be fooled by the simplicity of the application, but that that APA, the underlying architecture and set of stack components, what we talked about is, in fact, what might, what, an architecture might look like multiplied times the complexity scale magnitude of whatever your actual applications, products, and and services are. Absolutely.

And and I mentioned earlier that I like to think of.

Sergey Katsev

19:15 - 27:13

these as from the point of view of the users. Well, guess what? One of the other users or stakeholders of this system is the developer.

Right? And so the developer interacts with the system using GitHub actions in this case for the demo. And so GitHub actions actually is what launches or updates all of the containers, and it needs to interact with AWS because that's, of course, where, these particular containers are running.

And so there's dependencies on all of that. So what happens, if any of these dependencies go down? Right? So this is kind of the next phase of the analysis.

Well, first of all, some of them are production system dependencies. Some are tooling dependencies.

Right? And so if CloudFlare, for example, has has a hiccup, your users are not getting to your application. Maybe that's okay with you.

Maybe it's not. You still want to know about it so that you can decide what to do if a problem does arise.

If DNS has a problem, same thing. Webflow.

Okay? We can test that with, with chaos engineering. All of these things, as you go through the dependencies, you need to say, okay.

Do I care if this thing is slow, unreachable, not available? Right? So go through all of those pillars of of resilience and decide if they're important to you. And that's how you come up with your main KPIs.

Well, let let's move on to the observable SDLC piece. So Indeed.

One more comment actually before we do that. The the dependencies here, usually, you would have your application, you're applying monitoring to it, and, the stack map will actually pull all of the dependencies and draw as many boxes as it can.

In this case, I used it the other way around, sort of like the, TDD test driven development methodology where you write the tests first and then you write the application. I drew the stack map and I said, okay.

This is what I have in my mind. Let me go and develop and create monitoring strategy for this application.

So that's what we're looking at. Makes sense.

Alright. So the pipeline.

Relatively simple, but pretty powerful pipeline. So what it does when somebody makes a change on the develop branch, I have two branches, develop and main for this demo, is it runs unit tests, then it deploys to the staging environment, then it runs web page tests.

So web page test is one of, the pieces of catchpoint geared for web page performance. So that is more often than not used by web developers as opposed to other pieces of catch point which are used by site reliability engineers.

So, again, we we mentioned before, which metrics people care about is going to depend on who they are and what part what stage of the software development life cycle they're in. And so what web page test does is measures all of the relevant, web performance information.

For this particular, GitHub action, I have it doing something pretty simple, Measure these four metrics because they're important and make sure that they are all below the Google recommended thresholds. Cool.

So that works. It's wonderful.

This particular commit, the unit tests passed, the web page test run passed, and so it actually goes and creates an automatic pull request for the production environment, which then a human can take a look at and merge to deploy to production. But what happens if the web page test doesn't pass? So one of those metrics is, largest content full paint.

And as you can imagine, with this demo app, this image is actually the largest content full pane. So whenever that image comes up, that's the time in the timeline that, that LCP occurs.

So what I did just to, for the demo is I delayed the loading of that image by two and a half seconds. And so it just so happens that the recommended that Google's recommendation for LCP is two and a half seconds.

And so in this case, of course, by delaying it, LCP happened at about three and a half seconds. And so this pipeline automatically failed and said, hey.

You know what? You are not passing the web page test KPIs requirements that are part of this development process, and so it won't be automatically, create a a pull request to push it to production. So that's the development half of the OSDLC.

The second half is the site reliability half. Right? So now we take an application that's been tested.

Everything is working wonderfully. Let's deploy to production.

Oh, and by the way, let's create these templated CatchPoint, scheduled tests so that we know while the application is running in in production, CatchPoint is watching it to make sure that everything is working properly. And so same thing, integrated into GitHub actions.

And in this case, it checks to see if the tests already exist because if a new application is developed, we need to create new tests. This application already existed, so it skipped the create new steps phase.

Alright. And so here is what gets created.

So this is just my personal philosophy. Of course, different people do different things.

But for a web application, I like to have at least a quick HTTP web based test that's just running very often from different locations where I expect users to be. It also runs trace route just in case there's a problem to see if the problem is network related.

Then I want to make sure I test any of the important dependencies. In this case, DNS, as we talked about, the stack map is an important dependency.

And then I want to do a less frequent but more in-depth end to end test of the application. And so I'll show what that is in a second.

So here we go. So this is the the the quick HTTP test.

Everything is working wonderfully. No problem.

It kind of gives you an overview. I won't go through the details.

That's that's a more, in-depth catch point demo, but this is the quick test. This is the DNS test.

Right? So I'm showing different screens here. But, for example, this one is a scatter plot.

And what's interesting about this is you see the data is kind of spread out. So one thing we can do maybe is actually break it down and start analyzing what's going on here.

Hey. It looks like, interestingly enough, the Tokyo DNS experience is actually better than The US DNS experience from New York and from San Francisco.

Right? So you would think twice way. But it just so happens that for this particular, domain, it's actually faster.

And, of course, in this case, five to fifteen milliseconds or five to twenty milliseconds, it's not a huge difference in either case, but it can be a very big difference.

Leo Vasiliou

27:13 - 27:38

Well, it can be a very big difference when you consider a typical user journey. Probably has hundreds and hundreds and hundreds of individual request components each with their own, individual time.

So when you add it up, that 20, you know, all of a sudden is, milliseconds is all of a sudden, you know, two, three, four, full seconds. So so it Absolutely.

And and this.

Sergey Katsev

27:38 - 34:35

is where that site reliability engineering comes in. Right? So the application itself is working fine in this case, but can we improve the performance? Or can we notice where there might be some hot spots and then focus on those hot spots? And so these breakdown tools are one way to do that.

So skipping along, I mentioned there that we set up this more in-depth test. Well, the more in-depth test is actually launching, in this case, Playwright, which launches Google Chrome and clicks on the button.

Right? So that's that's the only user journey for this particular demo application. But you see here, okay, at the very beginning in step one, you see our our demo app and the button has not been clicked.

And then when I click on the the third step here, hey. Look.

The screenshot shows that it it has in fact been clicked. And you can, of course, measure all of the performance along the way and break it down and say, hey.

Is just the first part working properly? Is the interaction with the back end working properly? Really dig into what the most common flows of your, users might be. So that is how you set up the OSDLC in the first place.

Right? So just super quick or re super quick reiterate. You have your, in this case, GitHub actions.

Could be Jenkins. It could be anything else that is measuring the KPIs that are important to your business For developers, I said it's the core web vitals, measured through web page test.

For site reliability, I said, okay. We care about DNS.

We care about web performance. We care about the user journeys, and those tests got set up using, CatchPoints APIs.

So now let's break things, if you're okay with that, Leah. So I am okay with that.

So for chaos, what I figured is, okay. Well, you should always have a baseline performance.

And then you can do, of course, many more things, but I thought it would be interesting to test these, four different scenarios. So let's just walk through them.

So the first scenario is I had it in the stack map, but I didn't talk about there's a CDN. Right? So we're using the CloudFlare CDN.

So does it actually improve performance, in particular, for different users. So I'm running the service in, in AWS in on the East Coast.

I'm located in New York. So I said, hey.

Let's also test from New York. Let's test from San Francisco.

And maybe wishful thinking, but maybe we have some users in Tokyo. And so we see here that, hey.

The baseline the people in New York, as you can expect, are having a relatively good experience, and it's pretty comparable to the experience for the CDN. But if we look at San Francisco, okay, it's getting worse.

And if we look at Tokyo, as you could expect, the distance from the origin server, it's getting much worse. And so, in fact, the CDN does, as you can imagine, have a massive, performance improvement.

So then let's look at Bandwidth. So nowadays, most of our users have pretty fast Internet access.

What happens if they don't? Do you care? Maybe you don't, and then this is not an experiment that you would run. But if you have users accessing maybe an old DSL link or maybe even like a a three g or four g cell phone service to use your website, you can throttle the bandwidth accessing the website, and then you can take a look and see what happens.

Right? So, again, you see that, hey. If the bandwidth is throttled, yeah, performance is slower.

But what is interesting is it still works. Right? The the site still loads.

It's slower, but it still loads. And you you see that, hey.

Look. There's my application.

So the next thing I said is, hey. Remember this this big dependency? So just to kind of remind people, here's our front end service.

It has a big dependency on Webflow to pull in that giant image. Right? The the hero image in the application, which is this guy over here.

Well, what happens if that dependency doesn't work? So what happens if Webflow, in this case, is down or the catch point, website is down? What's going to happen to my application? So a couple of things. First of all, we see that kind of as expected, the, number of bytes downloaded is way, way, way less if we block the request.

Maybe more interestingly, the actual amount of time that it takes for the website to load is pretty much the same. Cool.

So that means it's working. Right? Well, whoops.

Sorry. The wrong oh, yeah.

So take a look at what actually happens. So it failed to load the image, which you would kind of expect, but none none of the rest of the application loaded either.

So, again, it's important to test these dependencies, right, to make sure that, hey. If this thing goes down, what's the result? Does my application still work? And, of course, in this case, on purpose, I I use JavaScript to not load the rest of the application until the image appears.

Not a great practice as a developer, but hopefully an effective demo. And the last example that I have.

So has anybody here, and I'll ask this as a rhetorical question, ever had their, maybe cough cough marketing team want to put an amazing, beautiful, high resolution logo, photo, whatever in their application or on their website without even realizing, hey. I'm putting this 10 meg file where it used to be a one meg file.

Leo Vasiliou

34:35 - 34:39

Well, Serge, yes. But only only because it was.

Sergey Katsev

34:39 - 35:47

a picture of you. So, I mean Why thank you.

Why thank you. I I look amazing in high def as does everyone else, of course.

And so let's simulate that. Right? And so here's the normal image.

The normal image is about 300 k. Well, let's replace it with a 10 meg image, which I'm actually pulling from Wikipedia instead of from the CatchPoint website.

And so let's see what happens. So interestingly enough, the page loads just fine.

Right? So no issues there. It's still rendered, but let's do a records compare across the two of them.

So you see there's the baseline, which is the primary run, and you see the request override. Look how much slower the application became when we added this giant, image.

Right? So nothing. Nothing.

Nothing. It used to be done loading at about eight hundred milliseconds.

Now it's not downloading until two point two seconds.

Leo Vasiliou

35:47 - 35:53

So three times as much. And that's a, moment.

Yeah. Exactly.

Sergey Katsev

35:53 - 36:38

And and, of course, the the goal here is to kind of hopefully get people to think about, oh, that's cool. What else can I test? And remember, for none of these tests that we actually change the production environment.

I guess you can kind of say for the CDN test we did because we actually tested two different things, one with CDN, one without. But for everything else, we used various overrides or various blocks or various, filters to change the bandwidth, to change the, the image or to block dependencies, and then see what happens with the application.

So that's what I have.

Leo Vasiliou

36:38 - 43:24

Alright. Serge, status first.

Thank you so so much. So one moment while I get back to, sharing the slides.

So let's go back here and see if it, picks up where we left off. Yes, sir.

Go ahead. Alright.

So demo of capabilities. What comes next? So let me just kind of park this image up here for a second.

So hopefully, hopefully, when the AI summary occurs, it'll mark this point in time right here as, like, you know, the important moments. One is that uptime and performance.

Right? You don't need to walk into the server room or walk around, pull a drive out, and wonder what happens. The second is the user conditions.

So when you're in your, CI and you're testing from your desktop or wherever you said you were testing one. Right? Going to be completely different set of data than when you, inject user conditions.

So we saw the demo. It was great.

Thank you very much. We had a couple of moments.

And so we wanna make sure we're talking about the entire life cycle. Right? The continuously, simply making it better cycle.

It's not a one and done, and it's not a matter of if, it's a matter of when there will be incidents. So, just pause, take a moment, think through your entire tool chain, think through your entire productivity stack.

It's all important. It should all be part of the conversation.

Your people, your processes, and, of course, the tech, and we only showed a handful. Think through the dependencies, make sure incorporate into this flow.

What happens if, an entire portion of your team, in a particular part of the world cannot work because of bad weather? Right? Organizational level, resilience and so on and so forth. Yep.

Excuse me. And then the idea again of not having to choose necessarily of shifting left, catching issues when they're cheaper to fix, but, versus, you know, shifting right, monitoring the actual user experience, but shifting wide because we don't actually have to inject real failures.

We can simulate what those failures will look like and then look at the data. Now what comes next is is, kind of the closing thoughts here before we wrap it up.

So if you look at this x axis, this is a long term trend. Right? Looks like it's a few months.

So what I wanna say here is, let's say, you know, day one, you're doing, you know, you're checking your stats and hypothetically says ninety five milliseconds, a hundred milliseconds, 97 milliseconds. Day two, ninety six milliseconds, a 101 milliseconds, 98 milliseconds.

Day three, ninety seven, one zero two, 99. Those daily results may be deceptively within your range.

And eventually, yes, you might get that pop up saying, outside of Google's recommendations. Yeah.

But when you look at a long term, and we might, you know, also incorporate the word regression here, we can see that that slow creep is not something you're going to see without the continual piece. Right? The SRE piece.

And what we wanna say here is that this gets back to what I was saying about in the real world where you have to take external conditions, these wireless users, mobiles, etcetera, into account, the performance that you're seeing is not going to be so clustered. And when you do your continual measurements, averages lie, so we must ensure to look at performance as a distribution.

So here, for example, we're looking at a bunch of different percentiles, but, could be histograms, could be cumulative. Maybe think of a group of people, kids in a classroom, people on a sports team, and line them up based on height.

Say, oh, what's the average height? I don't know. You know, five five.

Right? There's going to be people who are taller, people who are shorter. Right? Same thing with performance.

The average is broad. There's always gonna be faster.

There's always gonna be slower. And every percentile, looks different.

And different things will affect different percentiles. And I wanna slow down for a second here and say, if there's a blip, blip being the super technical term for micro outage, right, very short duration, that might move the ninety fifth percentile or that is to say, all affect only 5% of your users very quickly versus, say, a full release with the heavier payload is updated.

Maybe that 300 k image all of a sudden goes to 10 meg on accident like you were showing, that might move, you know, the fiftieth percentile or the median that is half of your user base is affected because they're on their mobile, you know, in a crappy Starbucks cafe or something like that. And then one more quick example, again, just to kinda go back to that slow creep trend.

So we have a long term trend that looks like this. Notice that crazy spike, on the end.

And if we kind of go back to that concept of Internet performance monitoring and break down by a view that looks like this, we can see that, if we break this down and talk through this. So Asia looks like it is in fact creeping up.

Europe, very much less so. And then North America, yeah, it definitely, has increased and then there is that crazy spike that we saw on that previous chart, which was an aggregate of these breakdowns and we can see, hey, I really messed up something in North America.

So that kinda brings us to the other dimension. You know, is it regional slash localized versus global? Right? So so micro short duration incidents, prolonged sustained incidents is one dimension.

Global, versus regional, local is the other dimension.

Sergey Katsev

43:24 - 43:57

And the goal here is to collect all of these, dimensions, cardinalities before you need them. Right? Because once an incident is happening, you no longer have the baseline.

You don't know what to compare to to see if things got worse. But collecting all of them the whole time means you can pull up data like this and say, hey.

Look. There's a spike.

But wait. The spike only impacted North America.

It didn't impact anything else. Therefore, uh-huh.

The root cause must have to do with North America. Let's look at.

Leo Vasiliou

43:57 - 45:31

three, your drill downs, go go into the stack map, click on the individual components. Exactly.

Alright. That was the, official last slide.

So we're just gonna, offer some closing thoughts, get ready to, to wrap this up, see if there's any other questions or any questions that come up. So first, thank you again for your precious time today.

We talked about the key concepts to make sure we're on the same page, saw a demo of the capabilities, and then talked about some of the what comes next. Right? It's not a one and done.

We'll go back to what we were saying earlier. That first slide or was the second slide, technically, I guess, why are we doing this is, like, if there are problems, if there are things, different folks might react differently and be worried about different things, but they have, the same goal or they should have the same goal.

IT and the business should be partners. And speaking of business, I always like to say that capabilities are the gateway to business outcomes.

So if you've got your tech speeds and feeds on the extreme left side, if you will, and your business outcomes, your business goals, your revenue on the extreme right side, if you will, talking about the capabilities is what is going to help make those IT to business conversations be easier. In order to do this, we need the ability to dot dot dot.

Jared H

45:31 - 45:32

Yep.

Leo Vasiliou

45:32 - 46:39

And then happens. Yep.

Analytics. Data becomes information, becomes knowledge, becomes wisdom.

So again, the data, right, your individual test runs, but also across the long term to do things for to do things like check for regretments across different dimensions with different levels of cardinality. Basically, what's important to you.

We didn't talk about this too too much, except when we said the idea of shifting wide. So make sure to standardize your telemetry across the life cycle so people aren't measuring in feet and inches in pre prod, but using the metric system, in production.

Right? We don't want different people responding to different things where they think one is noise, but it's, actually signal to someone else, or vice versa. Even simple is complex.

That crazy application surge, right, for the sake of demonstrating, that architecture, that we talked through in the stack map, very real.

Jared H

46:39 - 46:39

Yep.

Sergey Katsev

46:39 - 46:52

And I I really tried to keep it simple. That was my goal.

Keep it to two or three boxes. And then as I sat there thinking about, okay, what could go wrong? The boxes just kept coming.

Leo Vasiliou

46:52 - 47:48

And then slow is the new down. So please remember, you don't have to inject complete failures.

You can inject, simulate performance degradations as well. Does it move the ninety ninth percentile, ninety fifth, etcetera? Right? Slow is the new down.

And another way of saying that is, yeah, bad performance is equally as bad as downtime, so you can test it though without being fully down. And then again, shift wide, standardized telemetry.

You don't have to worry about the debate of shifting left, shifting right. Right? Shift wide.

Use IPM to shift wide.

Sergey Katsev

47:48 - 48:04

Well and and specifically, you can shift right now and test in production without actually breaking production. That's the goal.

Leo Vasiliou

48:04 - 48:22

Alrighty, Serge. That is all I got.

You know, so, again, I'll say thank you to TechStrong. Jared, if you wanna bring us, bring us home, wrap us wrap us up.

Jared H

48:22 - 49:33

Yeah. Sorry.

We've got some some weird, virtual background stuff going on here, don't we? And my camera settings are not any better. Well, I'll go ahead and close out.

I apologize, everyone. Serge, Leo, I wanna thank you for taking the time to join us.

It has been a pleasure having you. I enjoyed the demo, your slide deck.

That was amazing. Before we do close out, I wanna remind everyone, you do have some handouts in the resource section on the left side of your screen.

We also have a survey in the resource area. If if you wanna go ahead and click that, give us your input.

Your, it's invaluable to us. Your feedback does go a long way.

You can, you know, share any of your closing closing thoughts with Leo and Sergei. Tell us what you wanna hear from us on our next webinar.

Again, it goes a long way. It helps us out here at TechStrong.

It helps out the CashPoint team. And, lastly, this program was recorded.

And just to remind you, you will receive a link in your inbox shortly after we conclude today's session. Thank you for joining this Tech Strong learning experience.

We do look forward to seeing you on our next program. Have a great day, everyone.

Thank you.