The Cargo Cult of TCP_NODELAY: When to Use It

I learned a ton writing this post, especially about how HTTP2 works and how its binary format affects network performance. Hopefully you’ll learn something here as well!

What is Nagle’s Algorithm

My last post kinda exploded on Hacker News while a raging debate on TCP_NODELAY went on. It was wildly interesting to see the two sides but it was clear there was a cargo cult mentality regarding this flag. Some people swear by it but clearly didn’t understand it.

Nagle’s Algorithm was developed in 1984ish and it basically looks like this:

  1. if there is data in the buffer that would fill a packet, send it.
  2. else if it would not fill a packet, wait for an acknowledgment.
  3. send the data.

It’s actually a bit more complicated than that these days, but that is the essential (observable) gist.

Nagle’s Algorithm will have no effect on well-written applications, that can fill packets. For example, here’s nginx using the tcp_nodelay directive:

Go ahead and try it on various networks (synthetic throttling in Firefox/Chrome doesn’t do much, btw. Also, Chrome apparently has a few bugs around http2 that this test WILL hit. It tries to detect it and tell you about it.). If you have any congestion in your network at all, you’ll likely flip-flop between one or the other. If you have no congestion, you’ll still likely flip flop between them. You can turn on “sporadic” data which represents a poorly written application sending a few bytes of data, pausing for several dozen ms, and then sending a few more bytes of data, etc. This is when Nagle’s algorithm really has a chance to shine, even on a good network.

What’s really interesting is that you can “abuse” an HTTP2 connection to nginx to determine whether there is congestion along the route during the connection phase if keepalives are disabled. This is because nginx sends goaway and headers packets. If TCP_NODELAY is enabled, these are two separate packets instead of one.

Thus we can race an HTTP2 connection, one server with TCP_NODELAY and one server with Nagle’s Algorithm. In the Javascript console, it actually will output whether there is congestion and whether or not there are many hops between you and my servers.

With a static file, nginx’s algorithm makes either setting essentially random. Nginx sends 8k of data at a time (by default) which fits comfortably in every MTU. These need to be reassembled on non-jumbo frame networks, but this is a non-issue. I was really quite impressed with everything.

Myths and Legends: the delay

Most people tend to think that TCP_NODELAY means that not using it results in a delay.

This is not the case if you’re sending more than a packet’s worth of data. If you’re sending more than a packet’s worth of data, it virtually has no effect because packets are sent immediately.

If you’re sending less than a packet’s worth of data (~1.4 kilobyte-ish) in quick succession (less than the RTT of the connection), on a congested network it can have an enormous effect (try 10-byte size with sporadic data on a congested network) or no practical effect when no congestion. Note that when I say “practical” I mean in ‘human terms’ not ‘machine terms.’ So, a human won’t notice a difference on an uncongested network, but they will on a congested network if enough data is sent, whereas a machine with <50ms tolerance, will notice a difference over a long link, but there is less than a millisecond difference in a data center (at least during my testing).

However, if you’re sending less than a packet’s worth of data, less often (more than the RTT of the connection), TCP_NODELAY will always be the best case.

Thus a ‘client’ or something sending a single request will likely benefit from TCP_NODELAY (browsers have it turned on by default). Further, when mixed with Delayed Acknowledgements, if both ends of the connection are using Nagle’s, then things will go badly.

Also, you can turn off/on TCP_NODELAY mid-connection,. Thus, you can tune your socket to be an ideal socket based on the context of the response. If you know for a fact that you’ll be sending bursty data, perhaps turn on Nagle’s so that you can beat congestion. If you know you’ll be streaming data in chunks much larger than a packet (at least 8-9kb), then it won’t matter which you choose.

Some languages make this rather difficult to get to once a connection has been established or established by something outside your control.

Which setting should you use?

It’d be great if someone did an actual study on this, looking at various protocols and how this setting impacts them, on some fancy test networks more reliable than mine (looking to write a paper? this could be it!). One reason I’ve stayed far away from protobufs and GRPC has been how absolutely crappy it is on unreliable networks, and this setting is probably why.

In the end, choose a setting that makes sense for your application. If you control the whole connection end-to-end (especially important with a language like Go where it can be hard to get to the underlying socket after a connection has been made and wrapped by 15 libraries), it probably means you can have a “smart socket” that can adjust its behavior based on detected network conditions and the data it is about to transmit.

For example, nginx could probably improve its overall latency to send both of those packets in a single packet, or keep Nagle’s Algorithm enabled until after those initial packets are sent. Either one results in essentially the same outcome, but one is probably simpler to implement and maintain. I have no idea which one it is, but I do know software is always a lesson in trade-offs.

I believe Chrome does something similar because I saw the Magic, Settings, and Window update all in the same packet. But I think this is the latter optimization and isn’t using Nagles because this packet is far less than the MTU (152 bytes, or 1/3 of it is TCP overhead) and then immediately sends the GET request in another packet (461 bytes).

These could be combined into a single packet fairly easily. Again, either as an intentional optimization or careful use of Nagle’s Algorithm to let the device handle it. If you’re worried about Acknowledgments, you shouldn’t be because in 20 nanoseconds nginx is going to Acknowledge the new TCP session ticket, which was sent just before these packets.

This all feels remarkably like a micro-optimization. It really is until it isn’t. Get on a bad network (those ‘free international cell phone plans’ :shudder:) and you’re screwed if someone hasn’t put any effort at all into this and just went with the Cargo Cult.

Using a buffer

Some people advocate for using TCP_NODELAY with their own buffer and then only flushing the buffer once there is “enough” data in it. I think this is fine (despite my feelings towards contributing to even more buffer-bloat). As long as you can fill packets, it really doesn’t matter what the setting is. It is when you can’t fill packets that this actually makes any kind of difference, and adding a buffer might be worse than allowing Nagle’s Algorithm to do its job. Experiment, and do not, I repeat, do not be afraid to download Wireshark and learn how to read your packets. It is a valuable skill that, apparently, many people don’t take the time to learn. I bet you can learn the basics in less than an afternoon, or at least how to tell how big the payload is.

Epilogue: What’s up with the magic numbers?

You may have noticed some “magic numbers” in my post. Here’s the reason I chose them so that you can draw your own conclusions and adjust them based on your situation:

sending less than a packet’s worth of data (~1.4 kilobyte-ish)

Most home networks have an MTU of 1500 bytes. This is the lower-bound sized packet that will go anywhere. Packet headers are ~40 bytes (I don’t remember the exact number), so the max data you could send in a packet is around 1450-ish bytes, or 1.4 kilobytes.

whereas a machine with <50ms tolerance, will notice

I tested this pretty extensively in a datacenter. In my experience, any multi-packet data only experienced a very small delay (less than a ms, which could be attributed to jitter), and a single packet only experienced a delay of 2x the RTT. YYMMV, but I’d give a ballpark of 50ms epsilon for any longer links (such as between datacenters over a backbone).

If you know you’ll be streaming data in chunks much larger than a packet (at least 8-9kb)

Some datacenters and servers (and even some home networks these days) support something called ‘jumbo frames’ which allow having an MTU of 9000 bytes. Thus around 8k would be the upper-bound packet size.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.