At Catchpoint, my role can be summarized at a high level as two halves: designing and taking care of Engineering teams – and working with those teams to design and take care of the various distributed systems that run our platform.
I recently attended Sapphire Ventures' Hypergrowth Engineering Summit (thank you David Carter and Sapphire for the invitation!) where the sessions focused on creating and scaling high functioning engineering. Some of the sessions focused on scaling engineering teams, others on systems. What I found especially interesting was that although each topic was very different, they fit together very well to tell a cohesive story – and the advice for teams works very well for systems and vice versa.
Three key talks
All of the talks were great – but I wanted to share my thoughts about my favorite three.
- Dave Rensin, SVP Engineering at Pendo, talked about how the human brain works, and how understanding that helps engineering leaders drive results. The main premise (I’m skipping quite a bit clearly) is that the human brain's primary function is to make models of the future and then validate those models. If there's a mismatch between the model and reality, this causes anxiety. When humans are anxious, we have trouble making decisions, or make bad ones.
- J. Paul Reed, Sr. (until recently, Senior Applied Resilience Engineer at Netflix who gave an excellent talk at our SRE from home event a couple years ago) spoke about Engineering Resilience. Essentially, he stressed that resilience is a consumable resource - which can run out, yet which can also be refilled. Teams need to practice building resilience in various situations in order to be high functioning, he explained. Resilience is one of the critical "Rs", along with Redundancy and Reliability - and having systems which are redundant and reliable helps teams use their resilience meaningfully so that they don’t run out of it.
- Panelists Rob Zuber, CTO at CircleCI; Jonathan Nolen, SVP of Engineering and Product at LaunchDarkly talked about shifting right - and how maybe it's not so scary to release issues to production if you're confident that they can be resolved very quickly - and in order to be confident of that, you need observability of your entire system. I have the most thoughts on this session, so I wrote a separate blog post about it.
Building successful engineering teams
Combining the themes of these sessions (with numerous of the others) and applying them to successful engineering teams, the narrative is easy to understand but hard to implement: Our job as engineering leaders is to reduce the level of anxiety that our team members experience so that they can excel in their roles. One way to reduce anxiety is to build resilient teams, and one of the ways to improve team resilience is to make sure that they don't need to spend it on unnecessary things - so design your systems the right way to not have unnecessary emergencies. If you do have emergencies though (which after all are inevitable), make sure to have proper data to identify the problem as quickly as possible. For systems, this can be achieved with a proper observability strategy.
What happens when you flip the advice?
Interestingly, all of the advice above about teams and systems can actually be flipped and it's just as valid and valuable. When a system's model mismatches from reality, it makes the wrong decisions. It might go down the wrong branch of an if statement, for example - or in an extreme instance, a self-driving car might crash. Systems need to be resilient against model mismatches - and teams should always avoid single points of failure. With teams just like with systems, it goes back to end-to-end observability: collect as much data as possible about how your teams are doing and use this data to identify and solve problems.
Of course, it's usually harder to implement these strategies for teams than it is for systems.
Systems are generally predictable, they do what people tell them to do - even when they break, there's a thread of evidence as to what went wrong. Teams can fail in much more unpredictable ways - and it can be very hard to figure out exactly what went wrong. This is why we need to collect as much data as possible, earn as much trust with our teams as possible, and then do everything possible to help them succeed. And just like in system observability, the more vantage points that we have about how a team is doing and insights into how to help them thrive (whether from measured data or inspiration from other engineering leaders), the more likely we are to succeed.
Interested in joining our team and helping us build these systems? Check out our open positions.