Why the 200ms delay?
Over the course of our work, we have the fortune of diving into numerous network issues. One particularly interesting one involves a curious ~200ms delay in the acknowledgement of the response packet for small HTTP responses. This doesn’t sound all that terrible unless you have a number of small images, 204 responses, or your cache headers cause a 304 server response. This behavior can lead to delays, cause bad performance, and/or create concern when looking at monitoring data.
Whence the problem?
At its root, RFC896 [Nagle] attempted to solve the problem of numerous small packets needing acknowledgement over long thin pipes (high-latency, low bandwidth). The idea was that the sender does not send outbound segments within a connection until the in-flight segments are acknowledged.
When coupled with the language in RFC1122 Section 220.127.116.11, the thought was to increase efficiency by acknowledging every other full-sized segment, or to provide an acknowledgement no more than 500ms later. This is later covered in RFC2581 Section 4.2 about Generating Acknowledgements.
Considering the current environment where sites are flush with one-by-one images or 204 responses from third parties, the objects that don’t fill a TCP segment run the risk of exacerbating the delay. This appears to be most apparent with Windows clients, which by default has a 200ms ACK delay. Our testing on Linux and Mac clients does not show any signs of this behavior.
Here we have traced an icon being transmitted from a site to a client. The fourth column displays the delta time. Clearly the icon is small, in fact only 318 bytes. Accordingly, it appears that the request from No. 2668 is complete in 2690; however it is retransmitted by the server in 2833, making two segments outstanding before an ACK is sent back.
Here we can see a similar event but this time affected by an HTTP 302. Again, we see the all-too-familiar delay of the ACK and a retransmit. A careful network admin may be unduly concerned that this is a problem with the network rather than an issue with the client OS and the server’s TCP stack.
The retransmission is thanks to a retransmit timeout [RTO]. A Linux server expects an ACK within the expected RTO, calculated of the three way handshake, and when it does not receive it, it assumes it got lost and retransmits the packet. For more about the RTO please refer to our blog on understanding retransmission.
So now what?
Clearly there’s no silver bullet, but it does point to some interesting design challenges of which one should be mindful. There is little guidance, namely, RFC1122 Section 18.104.22.168 (page 99), which indicates that one can turn the Nagle algorithm off to speed the delivery of small streams, but that is not entirely without its risks.
There are some Microsoft knowledge base articles regarding changing the ACK frequency here and here. While advanced users (like gamers) have modified their registries, the average user has no idea of this issue.