One of the biggest challenges for online companies is troubleshooting an unexpected slowness of their web site or application in highly distributed architecture. Pinpointing the exact root cause of the slowness is like looking for a needle in a haystack.
Looking at a typical large scale web site infrastructure you have various layers where a problem can occur: ISP, Network devices (Routers & Switches), Load Balancers, Hardware, Web Servers, Application Servers, Database, Backend applications and services, and lastly the tens of firewall layers protecting the system.
Recently, we had the opportunity to help a company troubleshoot such a performance issue. They started monitoring their homepage with Catchpoint using IE8 and Chrome agents and observed abnormal slowness and variability.
From the chart we can clearly see that the time to load base HTML is quite high, it increases over time and it than quickly drops. Clearly the problem was not with the content of the page, or any 3rd party providers. The browser was spending about 40% of the time trying to download the HTML from the server.
With a couple of clicks we narrowed the problem down to the Wait time – the time it takes the client (browser) to get the first byte after the TCP connection is established. The metric shows how long it takes for the web server + the app server + database/backend to process the request. However, when you measure this metric over a wide area network (say the Internet) it is impacted by network connectivity and therefore is not always clear if it is the network or the application stack.
Wait Time for Homepage
One quick way to remove the impact of application performance from the picture, is to measure the response of an HTTP request served by the same server, over the same network, which is not handled by the web application. This is where my friend “Robots.txt” comes to play, as most websites have this tiny file on their web servers and is not handled by the application layer.
We monitored the “robots.txt” performance and compared its wait time to that of the homepage test. If the issue was the network, or the load balancer, or the server itself – we would see a correlation of the data. If it was the application layer, we would see no correlation.
Wait time for Homepage VS. Robots.txt
As you can clearly see the problem was not caused by the network, or load balancer – but the application itself. After further investigation of the application code the client was able to determine the cause and solve the problem.
When looking from outside in and lacking insight into the internal application performance, monitoring Robots.txt can be a quick way to figure out if your application code is the one at fault.
Mehdi – Catchpoint