Cautionary tale about using threads and fork()
I ran into one interesting problem. I found a process stuck on mutex. This sounds like a common deadlock, but it wasn’t that obvious. Thread that was stuck is the only thread in the process.
I clearly saw from printouts that that mutex was locked in the past by some other thread that no longer exist. Sometimes you can get into this kind of problems when you are not handling exceptions properly. The code is written in C++.
Unfortunately for me, that mutex is always locked using guard variable – guard variable locks the mutex in constructor and unlocks it in destructor. So any exception thrown while mutex locked, wouldn’t have caused the problem – when guard variable goes out of scope, it’s destructor would unlock the lock.
My first hunch was signals. One very common behavior when working with signals is when you are locking a mutex and then receiving a signal. If signal handler tries to lock same mutex you end up with deadlock. Fortunately for me the process was still alive and I was able to check the backtrace. Backtrace showed that thread is not stuck in the signal handler.
And this is it. I did not have second hunch. It took me a while to realize what really happened.
This process expected to run in the background. So the most natural thing for it to do was to fork() right after starting. A moment before that it was launching a thread that handle some asynchronous tasks. When process forks, the child process inherits memory and file descriptors from parent process. One thing that it is not inheriting is its threads.
In my particular case parent created a thread. Thread started running, locking the mutex. At this exact moment parent process called fork() and child process was born with that mutex locked.
To conclude, it never stops to amaze me how simple and yet how complicated multi-threaded programming can be. As a precaution, try not to mix multi-processed and multi-threaded programs. You may get surprising results.
The POSIX threads model isn’t a good one; too low level and error prone. It just happens to be the one that’s standardised. The CSP model is a lot nicer, more akin to Unix pipes, but arranged as graphs rather than linear pipes, and all within the same process space. Russ Cox has a nice page on the history of CSP-inspired languages at Bell Labs, https://swtch.com/~rsc/thread/ and you can see the latest popular descendent of these with Go, http://golang.org/
From one point of view I understand what you are saying. From another point of view, after reviewing threading models of go and erlang I simply can’t take these languages seriously. It is very difficult to write anything serious using go or erlang.
I give you an example of what I mean. I am working on pager program, something similar to less. You can take a look at it here: http://github.com/dowel/hless.
One of the things I would like it to do is searching. Since I am trying to do something better than less, I would like to use multiple threads to search for a pattern. Can I do something like this in go?
What I’d do in C++ using pthreads is figure out number of processors in the computer. Let’s call the number N. Then I’d spawn N threads that would search for a pattern in portion of a file. I’d probably set their CPU affinity to make sure I am utilizing all CPUs at once.
Can I do anything like that in go? Can I get anything near this level of control with go?
It seems like a new trend. With go you have goroutines. Python is completely dynamic. And both erlang and go are garbage collected. However each and every one of these features comes with a price. Most often you pay with performance. Today’s processors are indeed faster than processors ten years ago, but not significantly faster. Additional performance achieved by optimization. Languages like go give you extra convenience at the price of messing up with these optimizations.
Am I missing something?
Sure you can.
Well, perhaps I am not familiar with these languages enough.
I chanced upon your post upon discovering the solaris forkall(2)/forkallx(2) system call.
The risk to trying to do something better is inadvertently making worse. While modern OS schedulers aren’t perfect, is using sched_setaffinity a wise idea? In the commmon case where the cores are all otherwise idle, there’s no good reason to think the scheduler wouldn’t distribute the threads evenly. It’d prevent any potential benefit from NUMA-awareness scheduling to keep threads running on cores as close to the memory pages it’s accessing. Yes, I know the linux scheduler is far from ideal, but is it so bad that manually binding each thread to a particular CPU is worth it? I’m not sure how linux handles a thread bound to a CPU that has been suspended/shutdown as Android does for power management. Is it really necessary to search so thoroughly? On many occasions, I’ve used less with a file far larger than there was physical RAM in the machine (For example, a harddrive image.) and such a divide and conquer parallel search strategy may have thrashed the VM and wasted a lot of cpu time looking for extra results that may not be wanted.
But, since it’s an interactive application, it would be nice if it were to continue searching for a second match while it shows me the first match. That way, I don’t have to wait as long after pressing n.
Perhaps using the SSE4.2 instructions for optimized string searching would be of more value?
Again, just some thoughts and not a critique. Thank you for your blog posts.