A Pythonista Debugs the Golang stdlib

December 19, 2021 - startups dev

After recounting this experience during a recent conversation I realized I'd never written it down anywhere. My memory is only getting worse from here on in, so for the sake of posterity... one of my most fun deep-dive experiences so far: the time I found a bug in the Go standard library (and/or glibc, depending on who you ask).

A group of nervous developers gather in a conference room, anxiously watching a large-screen monitor. It's the third day of the quarterly DevJam, when all the fully remote developers come to HQ for a week of brainstorming, prioritizing, and tacos. Today we're demoing one of the more complex features we've been trying to pull off -- version control integration.

Good news! The first couple of commits and recoveries for the demo notebook go off without a hitch. Then, unexpectedly, an error pops up -- the Git repo server has crashed. "Oh, yeah," my colleague notes. "This is OK to demo, but it'll arbitrarily crash after it's up for a couple of minutes. We haven't gotten to the bottom of it yet."

I am intrigued, and volunteer to join my colleague in digging into the problem. It's clear right away from their initial investigation that only one service is crashing: the self-hosted Git repo server we're using for demos and proof-of-concept implementations with customers. We've been impressed with gitea so far, but this is a potential show-stopper.

Since it's the one service that's crashing, can we make it crash in isolation? Yes, it turns out we can: we can launch it on its own, connected only to our shared PostgreSQL database, and it will crash reliably after the first half dozen API calls. That's great -- now that we can reliably reproduce the problem by treating the service as a "black box", we can set up a simple test script and try to open the box up.

Our reproductions are now resulting in the same traceback consistently: this is enough to open an issue upstream on the open-source project. The traceback is, frankly, intimidating. At the top of the call stack is a Linux system call, and that's a lot of layers of code I don't have much experience with.

Looking down a few layers, though, the first public Go API function in this traceback is user.Current() -- per the documentation, this returns a structure full of user information, which lib/pq then uses to try and find a password file. Finding this out does solve our team's immediate problem: we can set the PGPASSFILE environment variable to skip this process in lib/pq, and pointing it at a dummy file doesn't break the network-based authentication mechanism we're currently using.

We put the workaround in place, but I'm not satisfied, and want to dig a little further. Why is user.Current() causing a segmentation fault? Looking through bug trackers finally yields a workaround which leads to an answer:

start with a statically linked binary (like our build of gitea)
perform a network call in a child thread (which gitea does when initializing a PostgreSQL connection in a pool of goroutines)
implicitly, this loads a shared C library to support the network call, and initializes thread-local storage in the child thread but not in the parent thread
call user.Current() in the parent thread (as gitea may do when creating another PostgreSQL connection to scale up its connection pool)
under the hood, user.Current() calls the getpwuid_r system call in the parent thread, which causes a segmentation fault when thread-local storage isn't correctly initialized

What an adventure! This neatly explains the behavior we're seeing -- failures after some unpredictable number of API calls, depending on timings within the Go runtime. The smart folks who figured out the Go standard library issue have already opened an upstream bug in glibc, so all that's left is to open a pull request to mitigate the behavior in lib/pq, so that future users run into this less often. Time for tacos all around!