Mutex without lock, Queue without push: cancel safety in lilos
- Levels of cancel safety
- What good is a queue that you can’t push into?
- What good is a mutex you can’t lock?
- Toward a more cancel-safe world
I’m trying to do something kind of unusual with lilos
: in addition to almost
all the APIs being safe-in-the-Rust sense, I’m also attempting to create an
entire system API that is cancel-safe. I’ve written a lot about Rust’s async
feature and its notion of cancellation recently, such as my suggestion for
reframing how we think about async
/await
.
My thoughts on this actually stem from my early work on lilos
, where I started
beating the drum of cancel-safety back in 2020. My notion
of what it means to be cancel-safe has gotten more nuanced since then, and I’ve
recently made the latest batch of changes to try to help applications built on
lilos
be more robust by default.
So, wanna nerd out about async
API design and robustness? I know you do.
Levels of cancel safety
The concept of “cancel safety” is pretty recent, and (in my opinion) we don’t
have great terms for talking about it. For my documentation, and for this post,
I find it’s useful to distinguish between different levels of cancel safety in
async
APIs.
-
Not cancel-safe. An API that generates a future that, if dropped before it resolves, will do something unpleasant or ill-defined, such as lose data, leave things in an inconsistent state, or panic. While panicking is not a safety issue from a memory safety perspective, it is a cancel-safety issue since it can cause a program that looks right to fail catastrophically later.
-
Strictly cancel-safe. You can drop the future generated by a strictly cancel-safe API without ill effect (other than wasting some CPU time, perhaps). You can turn around and call the same operation again and get essentially the same result. This is the ideal, but can be hard to reach in some cases.
-
Weakly cancel-safe. The cancellation behavior for a future is well-defined, designed to be useful, and documented, but falls short of the “just drop it and retry” property required for strict cancel-safety. Think of this category as the border zone between the previous two. I’m trying to avoid building any APIs in this category, but there’s still one function in
lilos
like this. -
Vacuously cancel-safe. An operation technically meets the criteria for strict cancel-safety in isolation, but encourages uses that make it trivial to accidentally write cancel-unsafe code on top of it. These APIs are bug generators!
In lilos
, the goal is to have all OS API be strictly cancel-safe, which I’ve
nearly achieved (there’s still one weakly cancel-safe operation in an optional
module). More importantly (to me, anyway) the intent is not to provide any
“vacuously cancel-safe” API.
This has required some careful thought and design, because – to my knowledge –
other async
platforms aren’t attempting to do this. tokio
is full of
non-cancel-safe API, for instance. In some cases it offers a cancel-safe
option alongside an unsafe option; in other cases (such as the async I/O
facility) there’s no safe option available.
So, let’s look at two of the cases I ran into in lilos
.
What good is a queue that you can’t push into?
lilos
provides a single-producer, single-consumer queue in the spsc
module. Its API has evolved considerably over the past four years, generally as
a result of me shooting myself in the foot with it. The current API is pretty
decent.
The queue uses a pattern that’s pretty common in Rust APIs designed to be used
without dynamic memory allocation: you create a queue, and then call a split()
operation to generate two endpoints. One endpoint can push, the other can pop.
Holding the push end of the queue only gives you the ability to put new things
into it, and holding the pop end only lets you take things out.
Here’s a silly example using the lilos
0.2-era API (for reasons that will
become clear in a minute):
let mut q = new;
let = q.split;
// Put something in
q_in.push.await;
// ...and take it back out
assert_eq!;
That example is silly because the same function holds both the push and pop ends of the queue, but, you get the rough idea.
During the lilos
0.3 series, however, I deprecated the push
operation. (It
will be removed entirely for 0.4.) That might seem strange – what good is a
queue that you can’t push into?
The reason I’m removing push
is that the operation is inherently not
cancel-safe:
- Calling
q_in.push(element)
produces a future that takes ownership ofelement
. - When that future resolves,
element
is stored safely in the queue. - However, if you were to drop the future before it resolves,
element
is lost!
Instead, the input endpoint of the queue now has a reserve
operation.
You use reserve
like this:
q_in.reserve.await.push;
Notice that there’s still a function here called push
, but it now occurs
after the await
. That’s important. The future produced by reserve
doesn’t
take ownership of the element going into the queue, so you can drop it freely
without any ill effect. When it resolves, it returns a type called Permit
,
which represents the right to push a single element into the queue.
You can choose to drop the Permit
, which will just give up that right without
changing the contents of the queue. Or, you can call the Permit::push
operation, which moves ownership of the element into the queue, while
simultaneously consumimg the Permit
so it can’t accidentally be used again.
The reserve
operation intrinsically meets the definition of strict
cancel-safety, while the push
operation intrinsically cannot. So, lilos
will
only offer reserve
.
To Tokio’s credit, I borrowed the idea of a reserve
operation from them. They
have not deprecated their cancel-unsafe push
operation yet, but perhaps
they’re planning to.
What good is a mutex you can’t lock?
lilos
also provides a mutex type, in the mutex
module. It looks very much
like std::sync::Mutex
from the standard library. It contains the data that it
guards (which all mutexes in Rust should do).
However, lilos::mutex::Mutex
doesn’t have a lock
operation by default.
The mutex in std
has a lock
operation that produces a MutexGuard
, a “smart
pointer” object that indicates that you’ve successfully locked the mutex, and
grants access to the guarded data. You use the std
mutex like this:
let mutex = new;
let guard = mutex.lock;
guard.stuff; // accesses the fields of the guarded data
// mutex is unlocked when guard is dropped
This is a very convenient design for std
, where cancellation is not a thing.
Async mutex APIs tend to follow the same pattern – for instance, the mutex in
lilos
0.2 and earlier, or the Tokio mutex. In those cases the code looks very
similar, but with an await
when you lock. Easy enough.
It’s straightforward to define an async mutex API that appears to meet the criteria for strict cancel-safety:
- The
lock
operation produces a future that resolves only when the mutex has been acquired. - If you drop that future before or after it resolves, somebody else can lock the mutex instead.
- If you turn around and retry, you’ll get back into line without ill effect.
It turns out this API design is “vacuously cancel-safe,” because of the way
mutexes are typically used. It is very common for programs to temporarily
violate the invariants of data stored in a mutex, only to “clean up” after
themselves before unlocking the mutex. This is perfectly safe: because you’ve
locked the mutex, you know that no other thread/task/etc. can see the changes
you’re making, so it’s not critical to keep everything perfect at every step.
You just have to make sure things are neat and tidy before you unlock. (This is
very similar to the common pattern of temporarily violating invariants through a
Rust &mut
reference – since it’s an exclusive reference, you can do it
without messing anyone up.)
All this falls apart if other code can cause you to suddenly unlock the mutex without intending to, before you’re done with your updates. And that’s the situation you wind up in with async cancellation. Consider this code:
// invariant: mutex always contains Some(resource), never None.
let mutex = new;
let g = mutex.lock.await;
// do stuff with the resource that requires taking it out of
// the mutex temporarily.
let resource = g.take.unwrap;
let new_resource = do_stuff_with.await;
// put it back before unlocking.
*g = Some;
The problem arises at the second await
(when we call do_stuff_with
). If
we’re cancelled while parked at that await
, we’ll unlock the mutex while its
contents are invalid. The next time we call g.take().unwrap()
, we’ll get a
panic – so we’ll get a bug report pointing to that code, rather than the code
that caused the problem by not maintaining the mutex contents correctly!
(You might note that this behavior is very similar to poisoning in the std
mutex, which happens if code panics while holding a mutex. Poisoning exists
specifically to prevent this problem of unlocking a mutex while a data
structure is in an invalid state. The tokio mutex doesn’t implement poisoning –
it just unlocks on panic. This is a design point that reasonable people can
differ on, but I consider it to be a mistake, personally. However, cancellation
safety of the mutex API is independent of whether or not the mutex implements
poisoning.)
One option for avoiding this is to never hold a mutex across an await
point.
However, you’ll have to check that rule in code review, rather than at compile
time, which means you’ll miss it eventually.
To restore that compile time check, a Tokio application might wrap the mutex
guard in a type that does not implement Send
. This will prevent an async
function from holding it across an await
point … in Tokio. This is because
Tokio requires that all futures implement Send
. lilos
, however, does not
have any such requirement, so this trick wasn’t an option for me.
I wound up deciding to do something slightly different. By default, the lilos
Mutex
type (in 0.3 and later) provides a perform
operation instead of a
lock
operation. perform
takes a closure to execute while the lock is held.
You use it like this:
mutex.perform.await;
Critically, it’s a normal Rust closure – not an async
block. This means it’s
impossible to await
during the perform
, so cancellation can’t occur. If you
attempt to await
in the body of the closure, you’ll get a compile error.
This covers many use cases for mutexes, but not all. Sometimes you genuinely
need to hold a mutex while await
-ing some operation. (Using a mutex to share
access to something like an I2C peripheral, for instance.)
lilos
0.3 provides an imperfect but useful escape hatch for these situations:
it defines a wrapper called CancelSafe
. CancelSafe
is not magic; it’s
defined like this:
;
By wrapping your data in CancelSafe
, you are asserting to the mutex that you
will always keep it in a valid state at any possible cancellation point. The
compiler can’t check this (yet), so you will need to take care to use it
correctly.
But, if you create a Mutex<CancelSafe<T>>
instead of a Mutex<T>
, the lock
operation will reappear. As before, the lock
operation itself is cancel-safe,
in that its future can be dropped without messing up your data structures –
but the futures you build on top of that can easily mess up your data
structures if they’re cancelled. By using CancelSafe
you’re indicating that
you’ve thought about this and are comfortable with the risk.
(The risk, incidentally, is not an unsafe
-style memory safety risk – not
unless you’re doing the data structure updates with unsafe
. Generally
speaking, messing up a data structure like this results in a panic down the
road, not data corruption.)
Toward a more cancel-safe world
I expect these APIs will need revision, and I’m planning on adjusting them in
the next compatibility-breaking lilos
release (which’ll be 0.4 or, if I get
ambitious, 1.0). At the very least, some of my name choices in the current APIs
aren’t great, and CancelSafe
is probably the wrong name for the wrapper.
perform
taking ownership of its closure is also questionable – it probably
wants to return a permit-style token like the queue does.
But, at this point, the non-deprecated API in lilos
is strictly cancel-safe
except for the handoff
module and the bit about perform
taking ownership of
its closure. The handoff
module is hard to fix; I can’t apply the same
technique I used on the queue, because handoff
doesn’t have any backing
storage. If I come up with something, I’ll definitely write it up!
As always, I’d welcome any feedback on these API designs (or their
implementation). If you’ve got ideas for how to improve the ergonomics around
cancel safety in lilos
, drop me a line.
Update from a few days later: I’ve revamped the perform
operation to not
consume the closure until it’s sure to succeed, and renamed the lock
operation
to lock_assuming_cancel_safe
– a much more visible scary name, in keeping
with my broader theories about programmers preferring shorter operations over
longer ones by default.
I’ve also turned off the handoff
module in the crate’s default features. This
means that, as far as I know and barring any new bugs, the next major release of
lilos
should be entirely strictly-cancel-safe by default.