Tag: golang

  • Optimizing CGO handles

    Optimizing CGO handles

    I work on FrankenPHP quite a bit, which is a PHP SAPI written in Go. For the uninitiated, PHP is written in C, so FrankenPHP uses cgo to interface with PHP. This also means that it uses Handles.

    Just to give you a bit of a background on Handles and what they are used for: in cgo, you can’t pass pointers back and forth between C and Go. This is because the memory locations are likely to change during runtime. So, Go provides “Handles” that are a map between a number and a Go interface. Thus, if you want to pass something to C, then when it gets passed back to Go, we can look up that handle using the number and get the Go value.

    In essence, this works “good enough” for most applications. The only issue is that it uses a sync.Map under the hood. This means that if we receive 1,000 requests at exactly the same time, we can only process them one at a time. Granted, the current implementation of sync.Map is quite fast… so, can we improve it?

    First, let’s get an idea of how fast it currently is by performing a benchmark. There is only one problem with Go’s current benchmarks: they don’t really show concurrent performance.

    Let me illustrate. Here’s the “default” output of the current Handle benchmarks:

    BenchmarkHandle
    BenchmarkHandle/non-concurrent
    BenchmarkHandle/non-concurrent-16                2665556               448.9 ns/op
    BenchmarkHandle/concurrent
    BenchmarkHandle/concurrent-16                    2076867               562.3 ns/op

    This doesn’t show us anything about the scale of the problem nor what we are solving for. So, I’m going to update the benchmark code a bit to show what is happening:

    func formatNum(n int) string {
    	// add commas to separate thousands
    	s := fmt.Sprint(n)
    	if n < 0 {
    		return "-" + formatNum(-n)
    	}
    	if len(s) <= 3 {
    		return s
    	}
    	return formatNum(n/1000) + "," + s[len(s)-3:]
    }
    
    func BenchmarkHandle(b *testing.B) {
    	b.Run("non-concurrent", func(b *testing.B) {
    		for i := 0; i < b.N; i++ {
    			h := NewHandle(i)
    			_ = h.Value()
    			h.Delete()
    		}
    	})
    	b.Run("concurrent", func(b *testing.B) {
    		start := time.Now()
    		b.RunParallel(func(pb *testing.PB) {
    			var v int
    			for pb.Next() {
    				h := NewHandle(v)
    				_ = h.Value()
    				h.Delete()
    			}
    		})
    		fmt.Printf("Time: %fs\tTime Each: %dns\tNumber of handles: %s\n", time.Since(start).Seconds(), time.Since(start).Nanoseconds()/int64(b.N), formatNum(b.N))
    	})
    }
    

    Now, let us see the output!

    goos: linux
    goarch: amd64
    pkg: runtime/cgo
    cpu: 11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz
    BenchmarkHandle
    BenchmarkHandle/non-concurrent
    BenchmarkHandle/non-concurrent-16                2618832               459.4 ns/op
    BenchmarkHandle/concurrent
    Time: 0.000045s Time Each: 44873ns      Number of handles: 1
    Time: 0.000063s Time Each: 634ns        Number of handles: 100
    Time: 0.006349s Time Each: 634ns        Number of handles: 10,000
    Time: 0.561490s Time Each: 561ns        Number of handles: 1,000,000
    Time: 1.210851s Time Each: 566ns        Number of handles: 2,137,053
    BenchmarkHandle/concurrent-16                    2137053               566.6 ns/op

    Here we can see what is going on under the hood. The benchmark tool actually runs several benchmarks increasing the number of iterations and estimates how many it can do in about 1 second.

    Now, we can kind of see our problem: we hit a ceiling on how fast we can create handles. Getting 100 thousand requests at once (not uncommon in a load test) means a minimum latency of ~6 milliseconds for the last request.

    How could we implement this in a lock-free way, without using mutexes (or at least, keeping them to a minimum)?

    I went for a walk and thought about it. The best I could come up with was some clever usages of slices and partitions. Then I implemented it. (you can see that implementation here, but I’m not going to talk about it)

    Once I submitted that patch to Go, it was immediately “held” since it introduced a new API. Thus, I went back to the drawing board to figure out how to implement it without changing the API.

    Essentially, it boils down to thinking about how we can cleverly use slices and trade some memory for speed. I started with a basic slice of atomic pointers ([]*atomic.Pointer[slot]). Now, for those who don’t know about Atomics in Go, we have to use the same type of all things in the atomic. An Atomic can only hold nil initially, after that, it must hold the same value. So, I needed to create a new type to hold the values: “slots.”

    // slot represents a slot in the handles slice for concurrent access.
    type slot struct {
    	value any
    }

    But we still need some way to indicate that the value is empty, and it needs to be fast. Thus we will create a “nil slot” that can be easily checked for equality (we can just look at the address).

    And finally, we need to atomically be able to grow the slice of slots and keep track of the size atomically as well. Altogether, this results in the following data structures:

    // slot represents a slot in the handles slice for concurrent access.
    type slot struct {
    	value any
    }
    
    var (
    	handles     []*atomic.Pointer[slot]
    	releasedIdx sync.Pool
    	nilSlot     = &slot{}
    	growLock    sync.RWMutex
    	handLen     atomic.Uint64
    )
    
    func init() {
    	handles = make([]*atomic.Pointer[slot], 0)
    	handLen = atomic.Uint64{}
    	releasedIdx.New = func() interface{} {
    		return nil
    	}
    	growLock = sync.RWMutex{}
    }

    Oh, and we also need a “free-list” to keep track of the freed indices for reuse. We’ll use a sync.Pool which keeps a thread-local list and falls back to a global one if necessary. It is very fast!

    Creating a Handle

    Now that we have all the parts necessary to store our handles, we can go about creating a new handle. Here’s the basic algorithm:

    1. Set the current handle integer to 1
    2. Create a slot for the data
    3. for ever:
    4. Check the sync.Pool for a released handle
      • if there is one, set the current handle to it
    5. See if the current handle is larger than the slice length, if so, grow it
    6. Try a compare and swap for nil on the current handle
      • if it fails, increase the current handle by one
      • if it succeeds, we have our handle
    7. goto 3

    For the smart people out there, you may notice this is O(n) in the worst case and should be “slower” than sync.Map since, according to ChatGPT, it should be O(1) to store in the map.

    However, we are not trying to be “faster,” but lock-free.

    The actual implementation looks like this:

    func NewHandle(v any) Handle {
    	var h uint64 = 1
    	s := &slot{value: v}
    	for {
    		if released := releasedIdx.Get(); released != nil {
    			h = released.(uint64)
    		}
    
    		if h > handLen.Load() {
    			growLock.Lock()
    			handles = append(handles, &atomic.Pointer[slot]{})
    			handLen.Store(uint64(len(handles)))
    			growLock.Unlock()
    		}
    
    		growLock.RLock()
    		if handles[h-1].CompareAndSwap(nilSlot, s) {
    			growLock.RUnlock()
    			return Handle(h)
    		} else if handles[h-1].CompareAndSwap(nil, s) {
    			growLock.RUnlock()
    			return Handle(h)
    		} else {
    			h++
    		}
    		growLock.RUnlock()
    	}
    }

    Retrieving a value is O(1), however:

    func (h Handle) Value() any {
    	growLock.RLock()
    	defer growLock.RUnlock()
    	if h > Handle(len(handles)) {
    		panic("runtime/cgo: misuse of an invalid Handle")
    	}
    
    	v := handles[h-1].Load()
    	if v == nil || v == nilSlot {
    		panic("runtime/cgo: misuse of an released Handle")
    	}
    
    	return v.value
    }

    As is deleting:

    func (h Handle) Delete() {
    	growLock.RLock()
    	defer growLock.RUnlock()
    	if h == 0 {
    		panic("runtime/cgo: misuse of an zero Handle")
    	}
    
    	if h > Handle(len(handles)) {
    		panic("runtime/cgo: misuse of an invalid Handle")
    	}
    
    	if v := handles[h-1].Swap(nilSlot); v == nil || v == nilSlot {
    		panic("runtime/cgo: misuse of an released Handle")
    	}
    	releasedIdx.Put(uint64(h))
    }

    When we put it all together, we can run our benchmarks:

    goos: linux
    goarch: amd64
    pkg: runtime/cgo
    cpu: 11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz
    BenchmarkHandle
    BenchmarkHandle/non-concurrent
    BenchmarkHandle/non-concurrent-16               11961500                99.17 ns/op
    BenchmarkHandle/concurrent
    Time: 0.000016s Time Each: 15950ns      Number of handles: 1
    Time: 0.000027s Time Each: 272ns        Number of handles: 100
    Time: 0.001407s Time Each: 140ns        Number of handles: 10,000
    Time: 0.103819s Time Each: 103ns        Number of handles: 1,000,000
    Time: 1.129587s Time Each: 97ns Number of handles: 11,552,226
    BenchmarkHandle/concurrent-16                   11552226                97.78 ns/op

    Now, even if we got one million requests at the same time, our latency would only be about 10 milliseconds. Previously, just from handles, our max requests-per-second were ~1.7 million. Now, it is ~10.2 million.

    This is a huge speed-up for such a micro-optimization… but what is the actual performance gain in FrankenPHP and what is the “worst-case” of O(n) actually look like?

    goos: linux
    goarch: amd64
    pkg: runtime/cgo
    cpu: 11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz
    BenchmarkHandle
    BenchmarkHandle/non-concurrent
    BenchmarkHandle/non-concurrent-16               11620329                99.21 ns/op
    BenchmarkHandle/concurrent
    Time: 0.000020s Time Each: 20429ns      Number of handles: 1
    Time: 0.000019s Time Each: 193ns        Number of handles: 100
    Time: 0.002410s Time Each: 240ns        Number of handles: 10,000
    Time: 0.248238s Time Each: 248ns        Number of handles: 1,000,000
    Time: 1.216321s Time Each: 251ns        Number of handles: 4,833,445
    BenchmarkHandle/concurrent-16                    4833445               251.7 ns/op

    In the worst case (ignoring our “free-list”) we are still handily faster than the original implementation (about twice as fast).

    For those of you who like graphs, here it is:

    For FrankenPHP, it results in a 1-5% increase in requests-per-second. So, while this does measurably improve the performance of FrankenPHP, it doesn’t equate to the 5x improvement we’d expect; which means it is time to go hunting for more improvements in Go.

    I’ve submitted a CL to Go with these improvements: https://go-review.googlesource.com/c/go/+/600875 but I have no idea if they will ever review it or consider it. From Go’s perspective (and most developers) this isn’t usually something on a hot-path that would benefit from such a micro-optimization. It’s only when handles are being created on a hot-path (such as in FrankenPHP’s case) that any performance gains can be realized.

  • Golang is evil on shitty networks

    This adventure starts with git-lfs. It was a normal day and I added a 500 MB binary asset to my server templates. When I went to push it, I found it interesting that git-lfs was uploading at 50KB per second. Being that I had a bit of free time that I’d much rather be spending on something else than waiting FOREVER to upload a file, I decided to head upstairs and plug into the ethernet. I watched it instantly jump up to 2.5 MB per second. Still not very fast, but I was now intensely curious.

    Since I figured I would have originally been waiting FOREVER for this to upload, I decided to use that time and investigate what was going on. While I would expect wired ethernet to be a bit faster than wifi, I didn’t expect it to be orders (with an s) of magnitude faster. Just to check my sanity, I ran a speed test and saw my upload speed on wifi at 40MB per second, and wired at 60MB per second.

    After some investigations with WireShark and other tools, I learned that my wifi channels have a shitload of interference in the 2Ghz band, and just a little in the 5Ghz band. During this time, I also learned that my router wouldn’t accept a single 5Ghz client due to a misconfiguration on my part. So, non-sequitur, apparently enabling “Target Wake Time” was very important (I have no idea what that does). Once that was fixed, I saw 600MB per second on my internal network and outside throughput was about the same as wired.

    But, why on earth was git-lfs so slow, even on 5Ghz? After looking at Wireshark while uploading to git-lfs, I noticed about 30-50% of the traffic was out-of-order/duplicate ACKs, causing retransmissions. I found that especially weird, not terribly weird because remember, this wifi network “sucks” with all my noisy neighbors. It turns out there are random 50-100ms delays all over the place. Probably due to interference. When I ran a speed test or browser session, however, it was less than 1%! In fact, git-lfs was barely sending any packets at all, like it was eternally stuck in TCP slow-start.

    Graph showing packets-per-second (total) and packet errors for git-lfs

    When I looked at the packets, they were being sent in ~50-byte payload chunks (~100 bytes total, MTU is 1500). I found that very interesting because I would expect Nagle’s algorithm to coalesce packets so there would be fewer physical packets to send. That is when it hit me, TCP_NODELAY must be set.

    Between that, and extremely regular 100ms delays, it could only get off a few packets before getting a “lost packet,” not to mention nearly 50% of every packet was literally packet headers. I was literally, permanently stuck in TCP Slow Start.

    TCP No Delay from Memory

    Nagle’s Algorithm was written approximately 4 decades ago to solve the “tinygram” problem, where you are sending a whole bunch of little packets, flooding the network, and reducing network throughput. Nagle’s algorithm essentially bundles all the little packets into one big packet, waiting for an ACK or a full packet to be constructed, whichever is sooner.

    Theoretically.

    It’s a bit more complex than that due to decades of changes to make the web better and more performant… but turning on TCP_NODELAY would mean that each of those 50 bytes are sent out as one packet instead of just a few bigger packets. This increases the network load, and when there’s a probability that a packet will need to be retransmitted, you’ll see a lot more retransmissions.

    If you want to know more, use Google.

    Diving in the code

    From there, I went into the git-lfs codebase. I didn’t see any calls to setNoDelay and when I looked it up, it said it was the default. Sure enough:

    Indeedly, the socket disables Nagle’s algorithm by default in Go.

    Is this a trick?

    I think this is a pretty nasty trick. The “default” in most languages I’ve used has TCP_NODELAY turned off. Turning it on has some serious consequences (most of them bad).

    1. Can easily saturate a network with packet overhead to send a single byte.
    2. Can send a whole bunch of small packets with high overhead (eg, half the data being sent is packet headers for git-lfs)
    3. Reduces latency (the only pro) by sending small packets
    4. Can cause havoc on an unreliable link

    I wasn’t able to dig out why Go chose to disable Nagle’s algorithm, though I assume a decision was made at some point and discussed. But this is tricky because it is literally the exact opposite of what you’d expect coming from any other language.

    Further, this “trick” has probably wasted hundreds of thousands of hours while transferring data over unreliable links (such as getting stuck in TCP slow start, saturating devices with “tinygram” packets, etc). As a developer, you expect the language to do “the best thing” it is able to do. In other words, I expect the network to be efficient. Literally decades of research, trial, and error have gone into making the network efficient.

    I would absolutely love to discover the original code review for this and why this was chosen as a default. If the PRs from 2011 are any indication, it was probably to get unit tests to pass faster. If you know why this is the default, I’d love to hear about it!

    That code was in turn a loose port of the dial function from Plan 9 from User Space, where I added TCP_NODELAY to new connections by default in 2004 [1], with the unhelpful commit message “various tweaks”. If I had known this code would eventually be of interest to so many people maybe I would have written a better commit message!

    I do remember why, though. At the time, I was working on a variety of RPC-based systems that ran over TCP, and I couldn’t understand why they were so incredibly slow. The answer turned out to be TCP_NODELAY not being set. As John Nagle points out [2], the issue is really a bad interaction between delayed acks and Nagle’s algorithm, but the only option on the FreeBSD system I was using was TCP_NODELAY, so that was the answer. In another system I built around that time I ran an RPC protocol over ssh, and I had to patch ssh to set TCP_NODELAY, because at the time ssh only set it for sessions with ptys [3]. TCP_NODELAY being off is a terrible default for trying to do anything with more than one round trip.

    When I wrote the Go implementation of net.Dial, which I expected to be used for RPC-based systems, it seemed like a no-brainer to set TCP_NODELAY by default. I have a vague memory of discussing it with Dave Presotto (our local networking expert, my officemate at the time, and the listed reviewer of that commit) which is why we ended up with SetNoDelay as an override from the very beginning. If it had been up to me, I probably would have left SetNoDelay out entirely.

    rsc on Hacker News

    Effects elsewhere

    This ‘default’ has some pretty significant knock-on-effects throughout the Go ecosystem. I was seeing terrible performance of Caddy, for example, on my network. It was fairly frustrating that I couldn’t identify the issue. But after some testing, now I know (I opened an issue).

    Much (all?) of Kubernetes is written Go, and how has this default affected that? In this case, this ‘default’ is probably desired. Probably. The network is (usually) reliable and with 10G+ links between them, so they can handle sending small-byte packets with 40 byte headers. Probably.

    This obviously affects git-lfs, much to my annoyance. I hope they fix it… I opened an issue.

    When to use this?

    In most cases, TCP_NODELAY shouldn’t be enabled. Especially if you don’t know how reliable the network is and you aren’t managing your own buffers. If you’re literally streaming data a chunk at a time, at least fill a packet before sending it! Otherwise, turn off TCP_NODELAY and stream your little chunks to the socket and let Nagle’s Algorithm handle it for you.

    Most people turn to TCP_NODELAY because of the “200ms” latency you might incur on a connection. Fun fact, this doesn’t come from Nagle’s algorithm, but from Delayed ACKs or Corking. Yet people turn off Nagle’s algorithm … :sigh:

    Here’s the thing though, would you rather your user wait 200ms, or 40s to download a few megabytes on an otherwise gigabit connection?

    Follow along

    This isn’t the end of the journey. Follow this blog to get updates.

    Continue on