This adventure starts with git-lfs
. It was a normal day and I added a 500 MB binary asset to my server templates. When I went to push it, I found it interesting that git-lfs
was uploading at 50KB per second. Being that I had a bit of free time that I’d much rather be spending on something else than waiting FOREVER to upload a file, I decided to head upstairs and plug into the ethernet. I watched it instantly jump up to 2.5 MB per second. Still not very fast, but I was now intensely curious.
Since I figured I would have originally been waiting FOREVER for this to upload, I decided to use that time and investigate what was going on. While I would expect wired ethernet to be a bit faster than wifi, I didn’t expect it to be orders (with an s
) of magnitude faster. Just to check my sanity, I ran a speed test and saw my upload speed on wifi at 40MB per second, and wired at 60MB per second.
After some investigations with WireShark and other tools, I learned that my wifi channels have a shitload of interference in the 2Ghz band, and just a little in the 5Ghz band. During this time, I also learned that my router wouldn’t accept a single 5Ghz client due to a misconfiguration on my part. So, non-sequitur, apparently enabling “Target Wake Time” was very important (I have no idea what that does). Once that was fixed, I saw 600MB per second on my internal network and outside throughput was about the same as wired.
But, why on earth was git-lfs
so slow, even on 5Ghz? After looking at Wireshark while uploading to git-lfs
, I noticed about 30-50% of the traffic was out-of-order/duplicate ACKs, causing retransmissions. I found that especially weird, not terribly weird because remember, this wifi network “sucks” with all my noisy neighbors. It turns out there are random 50-100ms delays all over the place. Probably due to interference. When I ran a speed test or browser session, however, it was less than 1%! In fact, git-lfs
was barely sending any packets at all, like it was eternally stuck in TCP slow-start.
When I looked at the packets, they were being sent in ~50-byte payload chunks (~100 bytes total, MTU is 1500). I found that very interesting because I would expect Nagle’s algorithm to coalesce packets so there would be fewer physical packets to send. That is when it hit me, TCP_NODELAY
must be set.
Between that, and extremely regular 100ms delays, it could only get off a few packets before getting a “lost packet,” not to mention nearly 50% of every packet was literally packet headers. I was literally, permanently stuck in TCP Slow Start.
TCP No Delay from Memory
Nagle’s Algorithm was written approximately 4 decades ago to solve the “tinygram” problem, where you are sending a whole bunch of little packets, flooding the network, and reducing network throughput. Nagle’s algorithm essentially bundles all the little packets into one big packet, waiting for an ACK or a full packet to be constructed, whichever is sooner.
Theoretically.
It’s a bit more complex than that due to decades of changes to make the web better and more performant… but turning on TCP_NODELAY
would mean that each of those 50 bytes are sent out as one packet instead of just a few bigger packets. This increases the network load, and when there’s a probability that a packet will need to be retransmitted, you’ll see a lot more retransmissions.
If you want to know more, use Google.
Diving in the code
From there, I went into the git-lfs
codebase. I didn’t see any calls to setNoDelay
and when I looked it up, it said it was the default. Sure enough:
Indeedly, the socket disables Nagle’s algorithm by default in Go.
Is this a trick?
I think this is a pretty nasty trick. The “default” in most languages I’ve used has TCP_NODELAY
turned off. Turning it on has some serious consequences (most of them bad).
- Can easily saturate a network with packet overhead to send a single byte.
- Can send a whole bunch of small packets with high overhead (eg, half the data being sent is packet headers for
git-lfs
) - Reduces latency (the only pro) by sending small packets
- Can cause havoc on an unreliable link
I wasn’t able to dig out why Go chose to disable Nagle’s algorithm, though I assume a decision was made at some point and discussed. But this is tricky because it is literally the exact opposite of what you’d expect coming from any other language.
Further, this “trick” has probably wasted hundreds of thousands of hours while transferring data over unreliable links (such as getting stuck in TCP slow start, saturating devices with “tinygram” packets, etc). As a developer, you expect the language to do “the best thing” it is able to do. In other words, I expect the network to be efficient. Literally decades of research, trial, and error have gone into making the network efficient.
I would absolutely love to discover the original code review for this and why this was chosen as a default. If the PRs from 2011 are any indication, it was probably to get unit tests to pass faster. If you know why this is the default, I’d love to hear about it!
That code was in turn a loose port of the dial function from Plan 9 from User Space, where I added TCP_NODELAY to new connections by default in 2004 [1], with the unhelpful commit message “various tweaks”. If I had known this code would eventually be of interest to so many people maybe I would have written a better commit message!
I do remember why, though. At the time, I was working on a variety of RPC-based systems that ran over TCP, and I couldn’t understand why they were so incredibly slow. The answer turned out to be TCP_NODELAY not being set. As John Nagle points out [2], the issue is really a bad interaction between delayed acks and Nagle’s algorithm, but the only option on the FreeBSD system I was using was TCP_NODELAY, so that was the answer. In another system I built around that time I ran an RPC protocol over ssh, and I had to patch ssh to set TCP_NODELAY, because at the time ssh only set it for sessions with ptys [3]. TCP_NODELAY being off is a terrible default for trying to do anything with more than one round trip.
When I wrote the Go implementation of net.Dial, which I expected to be used for RPC-based systems, it seemed like a no-brainer to set TCP_NODELAY by default. I have a vague memory of discussing it with Dave Presotto (our local networking expert, my officemate at the time, and the listed reviewer of that commit) which is why we ended up with SetNoDelay as an override from the very beginning. If it had been up to me, I probably would have left SetNoDelay out entirely.
rsc on Hacker News
Effects elsewhere
This ‘default’ has some pretty significant knock-on-effects throughout the Go ecosystem. I was seeing terrible performance of Caddy, for example, on my network. It was fairly frustrating that I couldn’t identify the issue. But after some testing, now I know (I opened an issue).
Much (all?) of Kubernetes is written Go, and how has this default affected that? In this case, this ‘default’ is probably desired. Probably. The network is (usually) reliable and with 10G+ links between them, so they can handle sending small-byte packets with 40 byte headers. Probably.
This obviously affects git-lfs
, much to my annoyance. I hope they fix it… I opened an issue.
When to use this?
In most cases, TCP_NODELAY
shouldn’t be enabled. Especially if you don’t know how reliable the network is and you aren’t managing your own buffers. If you’re literally streaming data a chunk at a time, at least fill a packet before sending it! Otherwise, turn off TCP_NODELAY
and stream your little chunks to the socket and let Nagle’s Algorithm handle it for you.
Most people turn to TCP_NODELAY
because of the “200ms” latency you might incur on a connection. Fun fact, this doesn’t come from Nagle’s algorithm, but from Delayed ACKs or Corking. Yet people turn off Nagle’s algorithm … :sigh:
Here’s the thing though, would you rather your user wait 200ms, or 40s to download a few megabytes on an otherwise gigabit connection?
Follow along
This isn’t the end of the journey. Follow this blog to get updates.