Safely writing code that isn't thread-safe
An under-appreciated Rust feature
- The bank account example
- When threads attack!
- When locks attack!
- Head-in-the-sand concurrency
- The bank account example, in Rust.
- The engineering marvel of Sync (and Send)
- The joys of not being thread-safe
- Explicitly opting out of Sync
- The other concurrency trait: Send and when to avoid it
- Conclusion
- Appendix: the relationship between Sync and Send
- Appendix: opting back in
One of the nice things about the Rust programming language is that it makes it easier to write correct concurrent (e.g. threaded) programs – to the degree that Rust’s slogan has been, at times, “fearless concurrency.”
But I’d like to tell you about the other side of Rust, which I think is under-appreciated. Rust enables you to write programs that are not concurrent. This feature is missing from most other languages, and is a source of much complexity and bugs.
“But wait,” you might be saying, “of course I can write code that isn’t concurrent in Java or Python or C!”
Can you, though? You can certainly write code that ignores concurrency, and would malfunction if (say) used from multiple threads simultaneously. But that’s not the same thing as writing code that isn’t concurrent – code that simply can’t be used concurrently, by compiler guarantee.
In Rust, you can. Let’s look at why you can do it, and why it’s awesome.
The bank account example
Here’s a very common example of a data structure that trips people up in the presence of threads. It’s a model of a bank account.
I’ve written this in C for reasons that will become apparent later. Please never
use a float
variable to store an amount of money, this is just an example.
/*
* Stores information about a bank account.
*/
;
/*
* Deposits 'amt' into the 'account' and returns the new balance.
*/
float
/*
* Tries to withdraw 'amt' from the account. Returns 'true' if it
* succeeded, 'false' if the account didn't contain enough.
*/
bool
This is just about the simplest way you could write this in C – it would look
very similar in Java. Those of you with thread-related scars have probably
already noticed the issue: in the presence of concurrency, the bank account
update operations deposit
and withdraw
are not necessarily atomic.
This means that, if two threads are accessing the same account, you could end up with the following interleaving of operations:
- Thread A: begins depositing
100
. - Thread B: begins withdrawing
100
. - Thread A: reads current account balance to increment it, let’s say it started out at
100
. - Thread B: also reads current account balance, and decrements it. Why did Thread B get to do a larger chunk of work here? Who can say? Maybe the computer is mad at you.
- Thread A: stores the balance after the deposit:
200
.
Thread B thinks it successfully withdrew 100
moneys, while Thread A thinks it
successfully deposited 100
moneys, which should leave the account balance
unchanged… but it actually increased by 100
moneys. This is certainly not
what the bank wants – and it’s also not what we, the programmers, intended!
When threads attack!
The bank account code was not written to be used concurrently. If it had been, it would have been written differently. But nothing prevents someone from using that code in a larger program with threads. This means that the code can be the victim of nonconsensual concurrency.
Nonconsensual concurrency is when your code has concurrency forced upon it. Nevermind that the code was not written with threads or other concurrency in mind – you’re getting threads whether you like it or not!
This sounds unpleasant, but unfortunately it’s very common. In my experience,
it’s the main way programmers learn about data races and other
concurrency-related bugs in the first place: the hard way. By noticing (or,
worse, having someone else notice!) that their code is misbehaving. The past
several decades have been a story of one layer after another being used from
threads without being designed for it, getting bitten by the bugs, making the
repairs, and repeating – starting from the fundamental libc
and friends and
working up.
The awful thing about nonconsensual concurrency – other than the part where it removes your agency as programmer, of course – is that it tends to be fixed in a hurry, after the fact. If you’ve got an existing chunk of code that worked fine before those darned threads showed up, you’re not likely to sit down and redesign all your data structures and algorithms from scratch to use fancy lock-free high-concurrency alternatives. No, if you’re anything like most programmers (myself included) you’re more likely to just slap some locks on things and hope it starts working again.
And in a lot of ways, this is a reasonable response! We’ve all got deadlines and other stuff to do, and it’s not like we invited that bozo over there to start using threads in our code. We know “slap a big lock on it” will work because it’s been done a gazillion times.
Right?
When locks attack!
As attractive as “hit it with a lock” may be, there are some issues.
-
Scope. It’s very important to make sure that all fields being guarded by a lock are only accessed with the lock held. If any accesses happen without coordinating through the lock, then you haven’t fixed the bug, only made it harder to find1. This mistake is very easy to make when retrofitting locks onto library code, like when defending against nonconsensual concurrency.
-
Granularity. Locking and unlocking are not free. For best performance using locks, it’s important that the code do as much work as possible with the lock held, rather than repeatedly locking and unlocking it. This also means that nesting components that each have their own lock won’t be great either, since you’ll have to lock a series of locks on the way in, and then unlock them all on the way out.
-
Liveness. While a lock is locked, no other thread can get to the chunks of code/data it protects – that’s what it’s for. However, this can also bottleneck a program. Imagine that we protect all accounts at the bank with a single lock. Now, no matter how many CPUs you put in your fancy bank computer, you’re only ever going to be processing one account operation at a time, leaving most of those CPUs idle! (Does it sound like this is in direct conflict with the Granularity point above? That’s because it is!)
-
Built-in overhead. Because locks are not free, your code is going to take some performance hit for using them. If your users are not using your code from multiple threads, this is simply a waste.
In Rust, locks (like Mutex
or, in some cases, RefCell
)
contain the data they protect, which ensures that all accesses happen
through the lock. This is important enough that I wrote a separate article on
it.. I’ll return to the topic below.
Those points hold across basically all mainstream programming languages, but there’s another one that is more subtle: assuming a concurrency scheme.
-
Mutexes (a common option for the type of lock used here) tend to require operating system support. Using an OS-specific version like
pthread_mutex
will prevent your code from running on other operating systems. And using any flavor will restrict your code to running on an operating system that provides mutexes – meaning that your code likely can’t be used in embedded contexts or inside the implementation of an OS kernel. (For our bank account example, that may be less of a concern, unless you’re writing some kind of BankOS I guess.) -
It is entirely possible to have concurrency without threads. You might have noticed that I’m using both of those terms in this article; threads are the most common form of concurrency that programmers encounter these days, but there are other options. Basically all languages except C(++) have an
async
mode that transforms code into state machines that can be interleaved on a single thread of execution. Inasync
mode, you generally need to be using a different kind of lock to keep from blocking execution of otherasync
routines on the same thread. If a lock is unavailable, you want to switch to a differentasync
routine, not stop and wait! -
Finally, while rare nowadays, it’s possible that your user doesn’t have threads, or the equipment that comes along with threads, like mutexes. This is most common in bare-metal embedded contexts, but is also true when developing on Hubris, where each task is isolated and deliberately single-threaded. If you use a mutex in your code, you’ve prevented your code from being used in these contexts.
Head-in-the-sand concurrency
There’s another option, of course. If we’re concerned about the performance cost of managing locks or using atomic data types, we could just not.
This approach is quite common, particularly (in my experience) in C code. The C
standard library has a bunch of examples of this pattern for historical reasons.
Some of these functions are not only unsafe to use from threads, they may not
even work if you call them at two levels in a nested for
loop, which sure
stretches the definition of “concurrency!” (strtok
is a good example of this.)
On the one hand, we might be tempted to scoff at such API as a relic of our pre-thread past. But I think that’s being too quick to judge – there are often very good reasons for writing API that isn’t thread safe, starting with simplicity. Our bank account example is so simple that a reader can easily convince themselves that it’s correct…in the absence of concurrency.
No, I would argue that the problem with non-reentrant code in C is not that it’s not reentrant – it’s that there is no way to prevent it from being misused, in a recursive function or concurrent program.
No way other than this, of course:
/*
* WARNING: THIS API IS NOT THREAD SAFE
*
* Do not call it in a threaded context.
*
* Please add this function to the mental list of hundreds
* of things that you have to manually look for in code
* review to prevent your software from shipping with bugs.
*
* If you forget about this, probably nothing will happen
* right away, but in about six months you will suddenly be
* swallowed by a sinkhole that opens in the middle of the
* street, or something to that effect.
*
* Memento mori.
*/
Because of course everyone will read this comment and remember it!
As I put it in my article on the Rust mutex type: comments are not a concurrency strategy.
If only there was a way to keep our code simple, but make it hard to misuse…
The bank account example, in Rust.
Let’s revisit that bank account example, but this time, let’s do it in Rust:
// Remember, this is merely an example, don't use f32 for currency.
/// Stores information about a bank account.
/// Deposits 'amt' into the 'account' and returns the new balance.
/// Tries to withdraw 'amt' from the account. Returns 'true' if it
/// succeeded, 'false' if the account didn't contain enough.
This is the exact same code with the syntax changed to appease a different
compiler. It sure looks like it has the same bugs as the C version, doesn’t it?
For instance, withdraw
is checking account.balance
and then decrementing it,
which seems like it has the potential for a race!
However, note that making any changes to the account (as in deposit
or
withdraw
) requires an exclusive reference (&mut BankAccount
). Exclusive
references are unique in Rust. This means that a BankAccount
can’t be
altered by two concurrent routines, because they can’t both hold an exclusive
reference.
In other words, when written the simple way, the bank account example in Rust does not have a concurrency bug.
But let’s say that we want account modifications through a shared reference
(&BankAccount
) to work. This can be useful even without concurrency – maybe
we’ve found ourselves in a case where we need to share BankAccount
s between
different data structures. The easiest way to make this work is with a Cell
to
enable mutation through a shared reference:
use Cell;
Okay, surely this has a concurrency bug, right? In a way, Cell
makes it
more obvious that these operations are not atomic! Look at deposit
, where we
get()
the old balance, do some work, and then set()
it – what if two
threads do this simultaneously? Bug, right?
Let’s see if we can demonstrate the problem by spinning up two threads that
fight over the account (using scoped
threads to do this conveniently):
use thread;
That ought to do it! Now when we run it, we will…
…huh. We can’t actually run it, because the compiler is upset.
error[E0277]: `Cell<f32>` cannot be shared between threads safely
--> src/lib.rs:33:11
|
33 | s.spawn(|_| {
| ^^^^^ `Cell<f32>` cannot be shared between
| threads safely
|
= help: within `BankAccount`, the trait `Sync` is not implemented
for `Cell<f32>`
note: required because it appears within the type `BankAccount`
--> src/lib.rs:5:8
|
5 | struct BankAccount {
| ^^^^^^^^^^^
= note: required because of the requirements on the impl of `Send`
for `&BankAccount`
= note: required because it appears within the type
`[closure@src/lib.rs:33:17: 38:10]`
What the compiler is saying is that, while we have successfully altered
deposit
and withdraw
to operate on shared references, we still can’t
actually use the BankAccount
type from multiple threads (cannot be shared between threads safely
). It seems to be related to our use of Cell
.
This is important.
The engineering marvel of Sync
(and Send
)
This is the under-appreciated Rust feature I’ve been alluding to.
Unlike most programming languages, where any data structure is assumed to be
safe to share across threads by default, Rust makes this part of the type
system. A value is safe to share across threads only if its type implements the
Sync
trait.
Sync
is a marker trait, meaning that unlike traits like Clone
, it doesn’t
define any operations you can perform on the type. It’s empty. Its declaration
would resemble (simplified):
It exists to serve as a marker for types that can be shared across threads. Any type that does not carry this marker can’t be shared across threads.
This seems like a small thing, but it has big implications.
Rust assumes that any data structure whose contents are entirely Sync
is safe
to share across threads, and automatically implements Sync
for the type. If
anything inside the data structure is not Sync
, neither is its container.
(There is also a way to opt-out of this behavior which I’ll discuss below.)
The Cell
type we used to implement the bank account above is not Sync
, so it
cannot be shared across threads. The compiler will prevent us from sending a
reference to a Cell
to a different thread, and will also stop us from stashing
one in a place where another thread might have implicit access to it, like a
static
variable.
So while our code was “correct” in the absence of concurrency in both C and Rust, the code is also correct (no caveats) in Rust, because we have written it in a way that prevents it from being used in situations where race conditions can happen.
The joys of not being thread-safe
By avoiding Sync
we can write a whole family of related types that can’t be
shared across threads. This does not prevent them from being used in threaded
programs! It just prevents the concurrency from leaking into our code. Our
code can be read top-to-bottom as though threads don’t exist, which makes it
much easier to understand and check for mistakes. We can write simpler and more
concise code by using basic (but not thread-safe!) types like u8
and Vec
,
instead of AtomicU8
and some kind of fancy concurrent vector with
lock-striping or something.
In addition to less complexity for the reader to deal with, we also create less complexity for the machine to deal with. Atomic operations are almost always more expensive (in terms of execution time and code space) than their simpler non-atomic equivalents.
If we (or our clients) want to use this code in a concurrent program, that
program gets to choose our concurrency strategy, instead of dictating it from
our library code. It might use a big mutex on everything, or mutexes on each
account, or have threads own a subset of data and talk to each other over
messages on channels. It might even use concurrency without threads using
async
– because our code does not await
, we can be confident that each
deposit
or withdraw
operation will execute atomically to completion even in
an async
environment.
Don’t get me wrong: there’s still a place for concurrent data structures, particularly in highly parallel applications with unpredictable data access patterns. But sometimes it’s nice to be able to write the simple code, and not worry about threads ruining your day.
Explicitly opting out of Sync
As I mentioned above, Rust automatically infers Sync
for types whose contents
are entirely Sync
themselves. This is often what you want, but it can also be
a bad thing, for two reasons:
-
It’s implicit. Someone reading the code cannot easily tell if a type is inferred to be
Sync
. More importantly, they can’t tell if it’s inferred to beSync
because the author intended it, or by accident! Why would it be bad to inferSync
by accident? Because… -
A structure may consist of all
Sync
members, but not be thread-safe!
That second reason is rare, but it does happen. It’s usually because you have
several members in (for example) a struct
, each of which is individually
Sync
, but you want to be able to update them all without possibly exposing
intermediate states before the update is done.
In both of these cases, you can explicitly opt your trait out of Sync
. There’s
a proposed syntax for making this really convenient called “negative impls,” but
as of right now (2022 November) it is still considered
unstable. So instead, we can use a trick:
We sneak a non-Sync
type into our data structure. Any one will do. You could
simply stuff a Cell
or a raw pointer in there. However, we’d really like to
not spend the extra space for the Cell
. We want the effect of putting a
non-Sync
member in our struct, say, but not the physical effects of doing
so… like it’s a phantom.
Sounds a lot like [PhantomData
] from the standard library!
I have a type like this in my codebase that I use to opt types out of automatic
Sync
:
;
I can then stuff it into some structure like so:
This makes my struct
act like it contains a Cell
, which (as we saw above)
is not Sync
, but doesn’t spend any memory to actually store a Cell
because
it’s a phantom.
I’m a little surprised that Rust doesn’t ship a zero-sized marker type2
for this in std::marker
, like they do for PhantomPinned
, but fortunately
it’s straightforward to write your own if you need it.
Rust uses the term “marker” for both traits and types. The idea is this: both marker traits and marker types mark something as having a particular property, without adding anything – operations or data. A marker trait affects a type, while a marker type affects a value.
The other concurrency trait: Send
and when to avoid it
I’ve focused on Sync
above because Sync
embodies most languages’ notion of
what “thread safe” means: data that can be accessed and updated from multiple
threads without corrupting things.
However, there’s another kind of safety in the presence of threads, a kind of safety that is more subtle and less common, but critical in the rare cases where you need it.
Consider this: many operating systems keep track of which thread (if any) has locked a mutex. If the data guarded by the mutex is accessed by any other thread, even if it’s correctly locked throughout the access, trouble can occur. For instance, many operating systems use mutex ownership information to implement thread priority management strategies to avoid priority inversion, or to keep track of which mutexes are locked to avoid deadlock. If you confuse the operating system about which thread is responsible for accessing the data guarded by the mutex, you defeat these mechanisms, and you’re headed to Bug Town.
In Rust, you can only refer to the data guarded by a Mutex
through the mutex
guard returned by the Mutex::lock
operation. (If this is unfamiliar to you, I
recommend reading my article on it; this section will make
more sense.) This means that, to produce a program where one thread locks a
mutex, and then a different thread accesses the guarded data, we would need to
arrange for that thread to have access to the mutex guard. If we can do that, we
can produce programs that defeat those operating system protections I described
above, and bad things will happen.
Let’s make bad things happen, shall we?
use ;
use thread;
That should do it! Now, when we run this, we see…
…oh, the compiler is saying something:
error[E0277]: `MutexGuard<'_, Foobar>` cannot be sent between threads safely
--> src/main.rs:12:17
|
12 | s.spawn(move || {
| ___________-----_^
| | |
| | required by a bound introduced by this call
13 | | // On the first thread, we will become the owner of the mutex.
14 | | let g = mutex.lock().unwrap();
15 | | // Send it to our collaborator.
16 | | guard_in.send(g).unwrap();
17 | | });
| |_________^ `MutexGuard<'_, Foobar>` cannot be sent between threads safely
|
= help: the trait `Send` is not implemented for `MutexGuard<'_, Foobar>`
That’s very similar to the warning we got before when we tried to share data between threads, but the words are slightly different: instead of “cannot be shared across threads safely,” it says “cannot be sent between threads safely.”
We’ve hit the other concurrency-related trait in the Rust standard library:
Send
. A type implements Send
if values of that type are safe to send between
threads. It so happens that, for the reasons described above, mutex guards do
not implement Send
, and so our bug is avoided.
Like Sync
, Send
is inferred automatically by the compiler. In my experience,
it is far less common to opt out of Send
than to opt out of Sync
. Here are
some cases where you might want to opt out:
-
If you’re implementing mutexes.
-
If your type acts as a handle to some sort of thread-local data, which would be meaningless if accessed from another thread.
-
If you’re doing some sort of logging or tracing that includes thread names, and you’d like your trace spans to indicate consistent thread names.
Rust does not ship a marker type for “not Send
” in the standard library as of
this writing, so until negative_impls
are stabilized, you can opt out of
Send
explicitly the same way we opted out of Sync
: by pretending to include
some basic type that is not Send
.
Raw pointers happen to not implement Send
, so we can do this:
;
Conclusion
Rust is the only language where I regularly write non-thread-safe data structures without caveats or fear. In cases where they apply, such data structures can be simpler and often faster than concurrent thread-safe data structures – with the drawback, of course, that you can’t use threads to speed up operations on them.
If you’re coming to Rust from another language that emphasizes concurrency, such as Java or Go or (increasingly) C, keep this in mind:
- You don’t necessarily need to make every data structure thread-safe, and
- If you get it wrong, you’ll get a compiler error, not a weird runtime bug.
Appendix: the relationship between Sync
and Send
Sync
and Send
interact in ways that make things much more convenient for us
programmers. None of this is language magic – it’s just trait impl
s in the
standard library.
-
If
T
isSync
, then&T
isSend
. This makes sense: to share a value across threads, each thread needs a reference to it, so we need to be able to send that reference. -
If
T
isSend
, then&mut T
is alsoSend
– meaning if you can transfer ownership across threads, you can also lend so long as you lose access for the duration. (Most cases where this works will also require the reference to be'static
.)
Appendix: opting back in
If you’ve written a data structure that is carefully engineered to be thread
safe (in either the Sync
or Send
sense), it may still contain types that are
themselves not Sync
or Send
, and so the compiler will not automatically
infer an impl. This is somewhat rare, but can happen if you’re building a shared
mutex-like data structure using std::cell::UnsafeCell
. In these cases you may
need to explicitly implement Sync
and/or Send
.
Because implementing these traits incorrectly would open up your program to
data races and ghosts, implementing them explicitly is unsafe
. So, you would
do it like this:
unsafe
Think very carefully before doing this.