Mutex without lock, Queue without push: cancel safety in lilos

2023-07-10 (updated 2023-07-12)

I’m trying to do something kind of unusual with lilos: in addition to almost all the APIs being safe-in-the-Rust sense, I’m also attempting to create an entire system API that is cancel-safe. I’ve written a lot about Rust’s async feature and its notion of cancellation recently, such as my suggestion for reframing how we think about async/await.

My thoughts on this actually stem from my early work on lilos, where I started beating the drum of cancel-safety back in 2020. My notion of what it means to be cancel-safe has gotten more nuanced since then, and I’ve recently made the latest batch of changes to try to help applications built on lilos be more robust by default.

So, wanna nerd out about async API design and robustness? I know you do.

Levels of cancel safety

The concept of “cancel safety” is pretty recent, and (in my opinion) we don’t have great terms for talking about it. For my documentation, and for this post, I find it’s useful to distinguish between different levels of cancel safety in async APIs.

Not cancel-safe. An API that generates a future that, if dropped before it resolves, will do something unpleasant or ill-defined, such as lose data, leave things in an inconsistent state, or panic. While panicking is not a safety issue from a memory safety perspective, it is a cancel-safety issue since it can cause a program that looks right to fail catastrophically later.
Strictly cancel-safe. You can drop the future generated by a strictly cancel-safe API without ill effect (other than wasting some CPU time, perhaps). You can turn around and call the same operation again and get essentially the same result. This is the ideal, but can be hard to reach in some cases.
Weakly cancel-safe. The cancellation behavior for a future is well-defined, designed to be useful, and documented, but falls short of the “just drop it and retry” property required for strict cancel-safety. Think of this category as the border zone between the previous two. I’m trying to avoid building any APIs in this category, but there’s still one function in lilos like this.
Vacuously cancel-safe. An operation technically meets the criteria for strict cancel-safety in isolation, but encourages uses that make it trivial to accidentally write cancel-unsafe code on top of it. These APIs are bug generators!

In lilos, the goal is to have all OS API be strictly cancel-safe, which I’ve nearly achieved (there’s still one weakly cancel-safe operation in an optional module). More importantly (to me, anyway) the intent is not to provide any “vacuously cancel-safe” API.

This has required some careful thought and design, because – to my knowledge – other async platforms aren’t attempting to do this. tokio is full of non-cancel-safe API, for instance. In some cases it offers a cancel-safe option alongside an unsafe option; in other cases (such as the async I/O facility) there’s no safe option available.

So, let’s look at two of the cases I ran into in lilos.

What good is a queue that you can’t push into?

lilos provides a single-producer, single-consumer queue in the spsc module. Its API has evolved considerably over the past four years, generally as a result of me shooting myself in the foot with it. The current API is pretty decent.

The queue uses a pattern that’s pretty common in Rust APIs designed to be used without dynamic memory allocation: you create a queue, and then call a split() operation to generate two endpoints. One endpoint can push, the other can pop. Holding the push end of the queue only gives you the ability to put new things into it, and holding the pop end only lets you take things out.

Here’s a silly example using the lilos 0.2-era API (for reasons that will become clear in a minute):

let mut q = Queue::new(storage_for_the_queue);
let (q_in, q_out) = q.split();

// Put something in
q_in.push(42).await;
// ...and take it back out
assert_eq!(q_out.pop().await, 42);

That example is silly because the same function holds both the push and pop ends of the queue, but, you get the rough idea.

During the lilos 0.3 series, however, I deprecated the push operation. (It will be removed entirely for 0.4.) That might seem strange – what good is a queue that you can’t push into?

The reason I’m removing push is that the operation is inherently not cancel-safe:

Calling q_in.push(element) produces a future that takes ownership of element.
When that future resolves, element is stored safely in the queue.
However, if you were to drop the future before it resolves, element is lost!

Instead, the input endpoint of the queue now has a reserve operation. You use reserve like this:

q_in.reserve().await.push(42);

Notice that there’s still a function here called push, but it now occurs after the await. That’s important. The future produced by reserve doesn’t take ownership of the element going into the queue, so you can drop it freely without any ill effect. When it resolves, it returns a type called Permit, which represents the right to push a single element into the queue.

You can choose to drop the Permit, which will just give up that right without changing the contents of the queue. Or, you can call the Permit::push operation, which moves ownership of the element into the queue, while simultaneously consumimg the Permit so it can’t accidentally be used again.

The reserve operation intrinsically meets the definition of strict cancel-safety, while the push operation intrinsically cannot. So, lilos will only offer reserve.

To Tokio’s credit, I borrowed the idea of a reserve operation from them. They have not deprecated their cancel-unsafe push operation yet, but perhaps they’re planning to.

What good is a mutex you can’t lock?

lilos also provides a mutex type, in the mutex module. It looks very much like std::sync::Mutex from the standard library. It contains the data that it guards (which all mutexes in Rust should do).

However, lilos::mutex::Mutex doesn’t have a lock operation by default.

The mutex in std has a lock operation that produces a MutexGuard, a “smart pointer” object that indicates that you’ve successfully locked the mutex, and grants access to the guarded data. You use the std mutex like this:

let mutex = std::sync::Mutex::new(some_data);

let guard = mutex.lock();

guard.stuff; // accesses the fields of the guarded data

// mutex is unlocked when guard is dropped

This is a very convenient design for std, where cancellation is not a thing.

Async mutex APIs tend to follow the same pattern – for instance, the mutex in lilos 0.2 and earlier, or the Tokio mutex. In those cases the code looks very similar, but with an await when you lock. Easy enough.

It’s straightforward to define an async mutex API that appears to meet the criteria for strict cancel-safety:

The lock operation produces a future that resolves only when the mutex has been acquired.
If you drop that future before or after it resolves, somebody else can lock the mutex instead.
If you turn around and retry, you’ll get back into line without ill effect.

It turns out this API design is “vacuously cancel-safe,” because of the way mutexes are typically used. It is very common for programs to temporarily violate the invariants of data stored in a mutex, only to “clean up” after themselves before unlocking the mutex. This is perfectly safe: because you’ve locked the mutex, you know that no other thread/task/etc. can see the changes you’re making, so it’s not critical to keep everything perfect at every step. You just have to make sure things are neat and tidy before you unlock. (This is very similar to the common pattern of temporarily violating invariants through a Rust &mut reference – since it’s an exclusive reference, you can do it without messing anyone up.)

All this falls apart if other code can cause you to suddenly unlock the mutex without intending to, before you’re done with your updates. And that’s the situation you wind up in with async cancellation. Consider this code:

// invariant: mutex always contains Some(resource), never None.
let mutex = tokio::sync::Mutex::new(Some(resource));

let g = mutex.lock().await;
// do stuff with the resource that requires taking it out of
// the mutex temporarily.
let resource = g.take().unwrap();
let new_resource = do_stuff_with(resource).await;
// put it back before unlocking.
*g = Some(new_resource);

The problem arises at the second await (when we call do_stuff_with). If we’re cancelled while parked at that await, we’ll unlock the mutex while its contents are invalid. The next time we call g.take().unwrap(), we’ll get a panic – so we’ll get a bug report pointing to that code, rather than the code that caused the problem by not maintaining the mutex contents correctly!

(You might note that this behavior is very similar to poisoning in the std mutex, which happens if code panics while holding a mutex. Poisoning exists specifically to prevent this problem of unlocking a mutex while a data structure is in an invalid state. The tokio mutex doesn’t implement poisoning – it just unlocks on panic. This is a design point that reasonable people can differ on, but I consider it to be a mistake, personally. However, cancellation safety of the mutex API is independent of whether or not the mutex implements poisoning.)

One option for avoiding this is to never hold a mutex across an await point. However, you’ll have to check that rule in code review, rather than at compile time, which means you’ll miss it eventually.

To restore that compile time check, a Tokio application might wrap the mutex guard in a type that does not implement Send. This will prevent an async function from holding it across an await point … in Tokio. This is because Tokio requires that all futures implement Send. lilos, however, does not have any such requirement, so this trick wasn’t an option for me.

I wound up deciding to do something slightly different. By default, the lilos Mutex type (in 0.3 and later) provides a perform operation instead of a lock operation. perform takes a closure to execute while the lock is held. You use it like this:

mutex.perform(|data| data.squirrels += 1).await;

Critically, it’s a normal Rust closure – not an async block. This means it’s impossible to await during the perform, so cancellation can’t occur. If you attempt to await in the body of the closure, you’ll get a compile error.

This covers many use cases for mutexes, but not all. Sometimes you genuinely need to hold a mutex while await-ing some operation. (Using a mutex to share access to something like an I2C peripheral, for instance.)

lilos 0.3 provides an imperfect but useful escape hatch for these situations: it defines a wrapper called CancelSafe. CancelSafe is not magic; it’s defined like this:

pub struct CancelSafe<T>(pub T);

By wrapping your data in CancelSafe, you are asserting to the mutex that you will always keep it in a valid state at any possible cancellation point. The compiler can’t check this (yet), so you will need to take care to use it correctly.

But, if you create a Mutex<CancelSafe<T>> instead of a Mutex<T>, the lock operation will reappear. As before, the lock operation itself is cancel-safe, in that its future can be dropped without messing up your data structures – but the futures you build on top of that can easily mess up your data structures if they’re cancelled. By using CancelSafe you’re indicating that you’ve thought about this and are comfortable with the risk.

(The risk, incidentally, is not an unsafe-style memory safety risk – not unless you’re doing the data structure updates with unsafe. Generally speaking, messing up a data structure like this results in a panic down the road, not data corruption.)

Toward a more cancel-safe world

I expect these APIs will need revision, and I’m planning on adjusting them in the next compatibility-breaking lilos release (which’ll be 0.4 or, if I get ambitious, 1.0). At the very least, some of my name choices in the current APIs aren’t great, and CancelSafe is probably the wrong name for the wrapper. perform taking ownership of its closure is also questionable – it probably wants to return a permit-style token like the queue does.

But, at this point, the non-deprecated API in lilos is strictly cancel-safe except for the handoff module and the bit about perform taking ownership of its closure. The handoff module is hard to fix; I can’t apply the same technique I used on the queue, because handoff doesn’t have any backing storage. If I come up with something, I’ll definitely write it up!

As always, I’d welcome any feedback on these API designs (or their implementation). If you’ve got ideas for how to improve the ergonomics around cancel safety in lilos, drop me a line.

Update from a few days later: I’ve revamped the perform operation to not consume the closure until it’s sure to succeed, and renamed the lock operation to lock_assuming_cancel_safe – a much more visible scary name, in keeping with my broader theories about programmers preferring shorter operations over longer ones by default.

I’ve also turned off the handoff module in the crate’s default features. This means that, as far as I know and barring any new bugs, the next major release of lilos should be entirely strictly-cancel-safe by default.

#api-design #async #embedded #lilos #rust

Cliffle

Levels of cancel safety

What good is a queue that you can’t push into?

What good is a mutex you can’t lock?

Toward a more cancel-safe world