- Levels of cancel safety
- What good is a queue that you can’t push into?
- What good is a mutex you can’t lock?
- Toward a more cancel-safe world
I’m trying to do something kind of unusual with
lilos: in addition to almost
all the APIs being safe-in-the-Rust sense, I’m also attempting to create an
entire system API that is cancel-safe. I’ve written a lot about Rust’s
feature and its notion of cancellation recently, such as my suggestion for
reframing how we think about
My thoughts on this actually stem from my early work on
lilos, where I started
beating the drum of cancel-safety back in 2020. My notion
of what it means to be cancel-safe has gotten more nuanced since then, and I’ve
recently made the latest batch of changes to try to help applications built on
lilos be more robust by default.
So, wanna nerd out about
async API design and robustness? I know you do.
Levels of cancel safety
The concept of “cancel safety” is pretty recent, and (in my opinion) we don’t
have great terms for talking about it. For my documentation, and for this post,
I find it’s useful to distinguish between different levels of cancel safety in
Not cancel-safe. An API that generates a future that, if dropped before it resolves, will do something unpleasant or ill-defined, such as lose data, leave things in an inconsistent state, or panic. While panicking is not a safety issue from a memory safety perspective, it is a cancel-safety issue since it can cause a program that looks right to fail catastrophically later.
Strictly cancel-safe. You can drop the future generated by a strictly cancel-safe API without ill effect (other than wasting some CPU time, perhaps). You can turn around and call the same operation again and get essentially the same result. This is the ideal, but can be hard to reach in some cases.
Weakly cancel-safe. The cancellation behavior for a future is well-defined, designed to be useful, and documented, but falls short of the “just drop it and retry” property required for strict cancel-safety. Think of this category as the border zone between the previous two. I’m trying to avoid building any APIs in this category, but there’s still one function in
Vacuously cancel-safe. An operation technically meets the criteria for strict cancel-safety in isolation, but encourages uses that make it trivial to accidentally write cancel-unsafe code on top of it. These APIs are bug generators!
lilos, the goal is to have all OS API be strictly cancel-safe, which I’ve
nearly achieved (there’s still one weakly cancel-safe operation in an optional
module). More importantly (to me, anyway) the intent is not to provide any
“vacuously cancel-safe” API.
This has required some careful thought and design, because – to my knowledge –
async platforms aren’t attempting to do this.
tokio is full of
non-cancel-safe API, for instance. In some cases it offers a cancel-safe
option alongside an unsafe option; in other cases (such as the async I/O
facility) there’s no safe option available.
So, let’s look at two of the cases I ran into in
What good is a queue that you can’t push into?
lilos provides a single-producer, single-consumer queue in the
module. Its API has evolved considerably over the past four years, generally as
a result of me shooting myself in the foot with it. The current API is pretty
The queue uses a pattern that’s pretty common in Rust APIs designed to be used
without dynamic memory allocation: you create a queue, and then call a
operation to generate two endpoints. One endpoint can push, the other can pop.
Holding the push end of the queue only gives you the ability to put new things
into it, and holding the pop end only lets you take things out.
Here’s a silly example using the
lilos 0.2-era API (for reasons that will
become clear in a minute):
let mut q = new;
let = q.split;
// Put something in
// ...and take it back out
That example is silly because the same function holds both the push and pop ends of the queue, but, you get the rough idea.
lilos 0.3 series, however, I deprecated the
push operation. (It
will be removed entirely for 0.4.) That might seem strange – what good is a
queue that you can’t push into?
The reason I’m removing
push is that the operation is inherently not
q_in.push(element)produces a future that takes ownership of
- When that future resolves,
elementis stored safely in the queue.
- However, if you were to drop the future before it resolves,
Instead, the input endpoint of the queue now has a
reserve like this:
Notice that there’s still a function here called
push, but it now occurs
await. That’s important. The future produced by
take ownership of the element going into the queue, so you can drop it freely
without any ill effect. When it resolves, it returns a type called
which represents the right to push a single element into the queue.
You can choose to drop the
Permit, which will just give up that right without
changing the contents of the queue. Or, you can call the
operation, which moves ownership of the element into the queue, while
simultaneously consumimg the
Permit so it can’t accidentally be used again.
reserve operation intrinsically meets the definition of strict
cancel-safety, while the
push operation intrinsically cannot. So,
To Tokio’s credit, I borrowed the idea of a
reserve operation from them. They
have not deprecated their cancel-unsafe
push operation yet, but perhaps
they’re planning to.
What good is a mutex you can’t lock?
lilos::mutex::Mutex doesn’t have a
lock operation by default.
The mutex in
std has a
lock operation that produces a
MutexGuard, a “smart
pointer” object that indicates that you’ve successfully locked the mutex, and
grants access to the guarded data. You use the
std mutex like this:
let mutex = new;
let guard = mutex.lock;
guard.stuff; // accesses the fields of the guarded data
// mutex is unlocked when guard is dropped
This is a very convenient design for
std, where cancellation is not a thing.
Async mutex APIs tend to follow the same pattern – for instance, the mutex in
lilos 0.2 and earlier, or the Tokio mutex. In those cases the code looks very
similar, but with an
await when you lock. Easy enough.
It’s straightforward to define an async mutex API that appears to meet the criteria for strict cancel-safety:
lockoperation produces a future that resolves only when the mutex has been acquired.
- If you drop that future before or after it resolves, somebody else can lock the mutex instead.
- If you turn around and retry, you’ll get back into line without ill effect.
It turns out this API design is “vacuously cancel-safe,” because of the way
mutexes are typically used. It is very common for programs to temporarily
violate the invariants of data stored in a mutex, only to “clean up” after
themselves before unlocking the mutex. This is perfectly safe: because you’ve
locked the mutex, you know that no other thread/task/etc. can see the changes
you’re making, so it’s not critical to keep everything perfect at every step.
You just have to make sure things are neat and tidy before you unlock. (This is
very similar to the common pattern of temporarily violating invariants through a
&mut reference – since it’s an exclusive reference, you can do it
without messing anyone up.)
All this falls apart if other code can cause you to suddenly unlock the mutex without intending to, before you’re done with your updates. And that’s the situation you wind up in with async cancellation. Consider this code:
// invariant: mutex always contains Some(resource), never None.
let mutex = new;
let g = mutex.lock.await;
// do stuff with the resource that requires taking it out of
// the mutex temporarily.
let resource = g.take.unwrap;
let new_resource = do_stuff_with.await;
// put it back before unlocking.
*g = Some;
The problem arises at the second
await (when we call
we’re cancelled while parked at that
await, we’ll unlock the mutex while its
contents are invalid. The next time we call
g.take().unwrap(), we’ll get a
panic – so we’ll get a bug report pointing to that code, rather than the code
that caused the problem by not maintaining the mutex contents correctly!
(You might note that this behavior is very similar to poisoning in the
mutex, which happens if code panics while holding a mutex. Poisoning exists
specifically to prevent this problem of unlocking a mutex while a data
structure is in an invalid state. The tokio mutex doesn’t implement poisoning –
it just unlocks on panic. This is a design point that reasonable people can
differ on, but I consider it to be a mistake, personally. However, cancellation
safety of the mutex API is independent of whether or not the mutex implements
One option for avoiding this is to never hold a mutex across an
However, you’ll have to check that rule in code review, rather than at compile
time, which means you’ll miss it eventually.
To restore that compile time check, a Tokio application might wrap the mutex
guard in a type that does not implement
Send. This will prevent an async
function from holding it across an
await point … in Tokio. This is because
Tokio requires that all futures implement
lilos, however, does not
have any such requirement, so this trick wasn’t an option for me.
I wound up deciding to do something slightly different. By default, the
Mutex type (in 0.3 and later) provides a
perform operation instead of a
perform takes a closure to execute while the lock is held.
You use it like this:
Critically, it’s a normal Rust closure – not an
async block. This means it’s
await during the
perform, so cancellation can’t occur. If you
await in the body of the closure, you’ll get a compile error.
This covers many use cases for mutexes, but not all. Sometimes you genuinely
need to hold a mutex while
await-ing some operation. (Using a mutex to share
access to something like an I2C peripheral, for instance.)
lilos 0.3 provides an imperfect but useful escape hatch for these situations:
it defines a wrapper called
CancelSafe is not magic; it’s
defined like this:
By wrapping your data in
CancelSafe, you are asserting to the mutex that you
will always keep it in a valid state at any possible cancellation point. The
compiler can’t check this (yet), so you will need to take care to use it
But, if you create a
Mutex<CancelSafe<T>> instead of a
operation will reappear. As before, the
lock operation itself is cancel-safe,
in that its future can be dropped without messing up your data structures –
but the futures you build on top of that can easily mess up your data
structures if they’re cancelled. By using
CancelSafe you’re indicating that
you’ve thought about this and are comfortable with the risk.
(The risk, incidentally, is not an
unsafe-style memory safety risk – not
unless you’re doing the data structure updates with
speaking, messing up a data structure like this results in a panic down the
road, not data corruption.)
Toward a more cancel-safe world
I expect these APIs will need revision, and I’m planning on adjusting them in
the next compatibility-breaking
lilos release (which’ll be 0.4 or, if I get
ambitious, 1.0). At the very least, some of my name choices in the current APIs
aren’t great, and
CancelSafe is probably the wrong name for the wrapper.
perform taking ownership of its closure is also questionable – it probably
wants to return a permit-style token like the queue does.
But, at this point, the non-deprecated API in
lilos is strictly cancel-safe
except for the
handoff module and the bit about
perform taking ownership of
its closure. The
handoff module is hard to fix; I can’t apply the same
technique I used on the queue, because
handoff doesn’t have any backing
storage. If I come up with something, I’ll definitely write it up!
As always, I’d welcome any feedback on these API designs (or their
implementation). If you’ve got ideas for how to improve the ergonomics around
cancel safety in
lilos, drop me a line.
Update from a few days later: I’ve revamped the
perform operation to not
consume the closure until it’s sure to succeed, and renamed the
lock_assuming_cancel_safe – a much more visible scary name, in keeping
with my broader theories about programmers preferring shorter operations over
longer ones by default.
I’ve also turned off the
handoff module in the crate’s default features. This
means that, as far as I know and barring any new bugs, the next major release of
lilos should be entirely strictly-cancel-safe by default.