As a monitoring and observability company, we have a lot of monitoring built into our systems, as well. We have the standard monitoring to make sure that systems are performing properly, data is flowing through our infrastructure, etc. At the same time, we have monitoring for any sudden changes to tests that our customers are running.
On September 29, 2021, 19:21:40 UTC, we started to see a tsunami of alerts at Catchpoint. They originated from some of our web tests from our synthetic nodes, occurring when our Let’s Encrypt “R3” certificate expired. These types of incidents are pretty rare.
Another example happened in 2020 when the Sectigo AddTrust root certificate expired. The difference with this event was that a lot more servers rely on Let’s Encrypt certificates.
The root cause of the crisis was not Catchpoint, our product, or any employee – but an issue with changes to the certificate path by a certificate issuer. Furthermore, as we work with many vendors, we’ve received updates from some which indicate that solving this problem is as easy as downloading the latest OS updates. While this is true for some, it does not solve the problem in general!
Below we explain why, and how to solve it on the server-side so that all of your clients can access your web service without issues. We also share our incident review of the event, so that the learnings will help others.
Do You Trust Let’s Encrypt?
Before we get into the weeds, I just want to say that I, personally, trust Let’s Encrypt. They’re a great company that’s made certificate management extremely accessible to everyone, and extremely developer-friendly.
Partially because of them, the number of websites using encryption has skyrocketed in recent years. Encryption is extremely important on the Internet. It’s the basis for secure communications. Whether you’re checking your bank balance, buying a new pair of socks from an e-tailer, or talking to your friends, you do so with the assumption that this transaction is secure.
In the last ~8 years (2013-2020), the percent of web pages using HTTPS has gone from 25% to over 84%! There are several reasons for this incredible growth. One of them is that Google has been gradually forcing sites to use HTTPS by making HTTP-based sites “not secure.” Still, no matter the cause, in that amount of time Let’s Encrypt has gone from issuing certificates for about 50M websites to over 230M!
At the same time, it doesn’t matter that I trust Let’s Encrypt if computers don’t. That’s exactly what happened on the evening of September 29 and again on the morning of September 30, 2021.
How Digital Certificate Trust Works
Here’s a quick summary of how certificate trust works on the Internet:
- Root certificates: There are a handful of Root certificates. These are issued by major companies under a lot of scrutiny and are installed in the Certificate Stores of computers worldwide by the company that developed and maintains the OS. If you have a computer that can connect to an HTTPS website, you have such a certificate store. At this moment, the MacOS laptop I’m writing this on has 161 “System Root certificates” installed.
- Chain of Trust: When someone launches a website nowadays, they must support HTTPS. Therefore, they purchase a certificate from a provider. There are many providers to choose from. Some have their own Root certificate and others have a Certificate Authority certificate, which was signed by one of the Root certificates or by another Certificate Authority. In this way, there is a chain of trust from the website’s certificate all the way to the root certificate.
- Intermediate certificates: When you go to the HTTPS website, the server hosting the website sends the certificate during the SSL handshake with the client (browser/HTTP client). It might also send you one or more intermediate certificates. These intermediate certificates are what it thinks you might need to connect the chain of trust from the server certificate to one of the root certificates you installed on your computer.
- Validation process: Your browser “walks” the chain of trust, from the server certificate up to the root. If it makes it all the way to the root and finds it in its “store,” the chain is validated, and the connection is allowed to proceed. Otherwise, you get a Security Warning like the one below.
R3 Certificate Expiry And The Chain Of Trust…
Let’s go through the details of what happened:
As mentioned, the first problems we saw with web tests from our synthetic nodes began at 19:21:40 UTC on September 29 when the Let’s Encrypt “R3” certificate expired. Here’s the certificate information for this intermediate certificate:
The most important piece of information here is the expiration date. Since this is an intermediate certificate, this means that Let’s Encrypt used this certificate to sign other certificates for their customers. For example:
The screenshot shows a web site whose certificate was signed with this R3 certificate. As soon as the certificate expired, this website was no longer accessible! A browser which tried to validate the website’s certificate would walk the chain of trust and find that the intermediate certificate is expired. That’s when you get scary looking errors such as this one in your browser:
Let’s Encrypt published a new R3 certificate! The new expiration date is in 2025 – plenty far away. Everything’s great, right? Well, not quite.
It turns out that the certificate needs to be updated in your computer – usually through a Windows or MacOS update – before it’ll work. A lot of people don’t update their computers as often as they should, though. Even worse, a lot of embedded devices rarely if ever update their certificates! Someone here at Catchpoint mentioned that his kids couldn’t watch their favorite streaming show from his SmartTV because of this issue!
So, OK – let’s say you got the latest R3 intermediate certificate installed in your whole fleet of devices. Now you can go to any of those millions of sites with Let’s Encrypt certificates, right? Well, sort of...until 10AM Eastern the following morning.
September 30th, 10AM EST: DST Root CA X3 Certificate Expiry And The Consequences
At 10AM on September 30, the DST Root CA X3 certificate expired. The details are a little confusing, but bear with me.
Originally, the DST Root CA X3 was used to sign all Let’s Encrypt certificates (including the R3 intermediate certificate above). Let’s Encrypt also cross-signed the certificates using their own ISRG Root X1 certificate. This was done because the DST certificate was already present in most browsers and devices. However, the ISRG certificate was not.
As Let’s Encrypt became more well-known and the ISRG certificate was available in all major devices, they stopped relying on the DST certificate.
Here’s Let’s Encrypt’s diagram of their certificate hierarchy:
Note that any server certificates (“Subscriber Certificates”) that were signed by R3 were signed by either DST root or ISRG root, or, most likely, cross-signed by both.
Here’s the DST root certificate’s information: https://crt.sh/?id=8395
What you’ll see is that there are actually two versions of the ISRT Root X1 certificate: https://crt.sh/?id=3958242236 and https://crt.sh/?id=9314791. The second one is self-signed. This is fine for a Root CA certificate which is present in most devices around the world. The first one, though, is signed by DST Root CA X3!
When the DST root certificate expired, this caused problems for two classes of systems:
- Systems which didn’t have an updated copy of the ISRT Root X1 certificate started failing to connect to sites using Let’s Encrypt because their site certificate was signed by R3, which was signed by ISRG, which was signed by DST – which was expired!
- Systems which did have the proper updated copy of the ISRT Root X1 certificate but wanted to validate the DST Root certificate anyway, because it cross-signed the R3 certificate!
The first category was relatively easy to fix: update the OS or download the new certificate and install it, assuming it’s not an embedded device that hasn’t issued an update.
The second one is harder. For example, any software which relies on OpenSSL 1.0.2 or earlier will have this problem – and there’s no way for the client to fix it.
Think of it this way: usually, your website sends the server certificate and any intermediate certificates that the client might need. As the client walks the chain of certificates, it sees the site certificate signed by R3. Then it sees that R3 is signed by ISRG, but also by DST – some browsers or other HTTP clients only validate one. They then find that the ISRG certificate is valid and they’re satisfied. Others need both to be valid but they’re not, because the DST certificate is expired!
If you run a website and want customers to be able to connect from devices such as these, there’s only one fix: Regenerate the certificate that your site uses so that it is no longer cross-signed by the DST certificate. Then your server will stop sending it to the client to validate, and the client won’t fail to validate it.
This is particularly important if you have HTTPS-based services that aren’t being accessed directly by browsers on a laptop. Maybe you’re serving RSS feeds or have an API accessed by embedded devices. Maybe your clients actually access the site through a proxy (there's a higher chance that some of the users trying to access your server are unable to due to this problem).
Fixing This On Your Server
Here’s how I fixed the issue on a test server – please note that some clients may still ignore the certificate chain sent by the server, so no guarantees that it’ll resolve the issue for all – but it should resolve it for modern/compliant clients. This is just an example – please follow the proper procedures for requesting and updating certificates for your environment!
The first thing I did was to renew my certificate. A lot of people thought this was sufficient, but it turns out that the certificate chain configured by the Let’s Encrypt tool still included the ISRG certificate signed by DST!
You can check this using openssl s_client (openssl s_client-connect www.site.com:443 –showcerts)
This means that when a client connects to thisserver, it’ll send this chain and some clients will attempt to validate ISRG using DST – after all, it’s issued by DST!
On my system (Ubuntu 18.04 LTS with Apache 2 and Let’s Encrypt’s certbot installed), the certificates are in /etc/letsencrypt/live/www.mysite.com/fullchain.pem. Certificate chains are listed in order, so the last one in this file is the incorrect ISRG certificate. I got the latest version from here: https://crt.sh/?id=9314791 (there’s a download link on the left side), and replaced the one in fullchain.pem. After restarting the web server, I now have the correct chain being offered.
Now, the server sends the self-signed ISRG root certificate, which an updated client will be able to validate without needing the expired DST root certificate.
Here at Catchpoint, we have a huge footprint of test agents in every corner of the world. These agents have different configurations of hardware, software, firmware, etc. Because of this large fleet, we saw and solved almost every flavor of the issues described above.
However, the customers accessing your website probably don’t have a 24/7 Operations team. In theory, we can update all of our agents which act as clients to connect to Let’s Encrypt-based sites properly or otherwise ignore this server misconfiguration issue. However, the reality is that you cannot expect every user in the world to do this – and often they cannot. As the owner of that service, you have the solution within your control on the server side, so it is upon you to fix it.
We hope that this explanation is helpful for you to solve whichever issue you might have run into, and have your systems trust Let’s Encrypt once again.
If this kind of incident analysis is exciting to you, reach out - we are hiring in Engineering, Operations and Technical Support.