Making Safe Things From Unsafe Parts

Learn Rust the Dangerous Way, Part 5

2019-12-14

In part 4 we took the unsafe code that deals with treating arrays of f64 as arrays of vectors, and we corralled it into a safe API.

In this installment, we’ll look at the remaining reasons why advance is an unsafe fn, and make it safe — not by removing all the unsafe, but by narrowing it down.

This one’s a doozy — the remaining changes to advance are hard to separate, so I’ve packed them all into one section. Now is probably a good time to refill your coffee.

The state of `advance`

When we left off last time, our program looked like this:

nbody-4.rs

Like in the original C program, advance is a long function — 149 lines long, to be exact — so I won’t reproduce the whole thing here. Instead, we’ll look at highlights.

advance is still an unsafe fn. This normally means that a function can violate memory safety if used incorrectly — it’s a warning to callers that they need to ensure that some conditions hold, or the program may do totally undefined things.

Is advance actually unsafe to call? Sort of, but that doesn’t matter just yet. If you were to delete the unsafe modifier and try to compile, you’d get a bunch of errors (27 of them). This is because an unsafe fn also gains the ability to call other unsafe operations without being explicit about it, and we’re still doing that. What unsafe operations is advance relying on? If you were to study those compile errors, you’d see two classes of operations, which I’ll discuss in the next two sections.

Mutable statics

advance contains two static mut variables, declared (as of the end of last chapter) as follows:

static mut position_Deltas: [Interactions; 3] =
    [Interactions { scalars: [0.; ROUNDED_INTERACTIONS_COUNT] }; 3];
static mut magnitudes: Interactions =
    Interactions { scalars: [0.; ROUNDED_INTERACTIONS_COUNT] };

These two variables, position_Deltas and magnitudes, are used like locals within advance, but are declared as static mut.

Why is this unsafe? Basically, because it causes the function to not be reentrant. By storing some local data in a global variable, each call to the function depends on the previous one, and two concurrent calls to the function — perhaps from separate threads — would corrupt one another’s data and produce incorrect results.

Rust really doesn’t like this sort of non-reentrant function, because Rust assumes that you might want to use threads someday, and thread safety guarantees are part of Rust’s core values. But this sort of thing is equally dangerous in C, where threads are becoming more common. (See, for example, strtok vs. strtok_r.)

So if it’s dangerous, why does the program do it?

The original program didn’t document the intent behind using static, but I’m pretty confident that it’s an optimization to reduce the cost of calling advance.

These variables aren’t tiny — each Interactions contains 10 f64s, so position_Deltas contains 30, for 240 bytes. If these were declared as locals, the function would need to allocate a large stack frame to contain them, and initialize them on each call. advance is called a lot, and that cost would add up. Experimentally, the naive approach to switching these variables to locals — replacing static with let — costs about 5%!

(Notice that this optimization is doing essentially the same thing as the use of uninitialized memory we saw in part 3: it’s trying to avoid initializing a chunk of memory on each call if it doesn’t need to. In part 3, that optimization proved unnecessary; in this case, the optimization works, probably because the initialization loop is more complex, confusing dead store elimination. Remember to measure the effect of your optimizations!)

We’ll address this unsafety by replacing this optimization with an equivalent, safe one, later in this section. But first, there’s another unsafe to consider:

SIMD intrinsics

All the SIMD intrinsics in the standard library are unsafe.

That might seem unnecessary and arbitrary — after all, most safety issues we’ve considered involve memory, and a function like _mm_add_pd just adds numbers without touching memory. So why is it unsafe?

At the risk of being snarky: blame Intel.

We’re using SSE2 instructions in this program. What happens if you run SSE2 code on a processor that doesn’t implement SSE2? One would hope that, when the processor first hits an unsupported instruction, it would trigger some sort of fault and reject the program.

Nope! Many SSE/SSE2 instructions encode other actions on earlier processors. They’re undocumented, but they work. So what looks like a vector add to us might look like a jump or a store to memory to an earlier CPU.

And so, since an unsupported SIMD instruction can destroy the program’s environment in unpredictable ways, using them explicitly like we’re doing is unsafe. (The current Rust program, like the original C program, will do unpredictable things on an older CPU.)

It’s entirely possible to use SIMD in safe Rust — rustc is surprisingly good at auto-vectorizing loops, there are convenient third-party libraries, and you can always check the target processor model at compile time. We’ll take that last option, later in this section.

Getting to safety

Getting the static out

To recap: advance declares two working arrays as static to avoid paying the cost of initializing them on each call. This optimization works, and saves around 5%, but it renders the function non-reentrant and unsafe.

These arrays also carry data between invocations of the function. This is purely a side effect; advance overwrites the arrays every time, so the data from previous invocations doesn’t get used, but it still gets stored after advance returns.

This observation is key to making the trick safe. What other ways could we declare the arrays so that advance doesn’t have to stack-allocate them on every call, and where data may flow between successive calls to advance?

How about as locals in advance’s caller?

Allocating the arrays as locals in main, and passing them by reference to advance, achieves the same optimization goals without using any unsafe code at all.

The signature of advance changes to allow the caller to pass in references:

unsafe fn advance(
    bodies: &mut [body; BODIES_COUNT],
    position_Deltas: &mut [Interactions; 3],   // New!
    magnitudes: &mut Interactions,             // Also new!
){

…and then we delete the static declarations.

We’ve had Interactions declared inside the body of advance ever since we introduced it in part 4; if we’re going to expose it to the caller, we need to move it, its impl, and the two consts it depends on a few lines up, outside advance. (Copy-paste will suffice.)

Finally, the caller, main, needs to allocate the two variables and pass them in by reference. main now looks like this:

fn main() {

    // These are new:
    let mut position_Deltas: [Interactions; 3] =
        [Interactions { scalars: [0.; ROUNDED_INTERACTIONS_COUNT] }; 3];
    let mut magnitudes: Interactions =
        Interactions { scalars: [0.; ROUNDED_INTERACTIONS_COUNT] };

    unsafe {
        offset_Momentum(&mut solar_Bodies);
        output_Energy(&mut solar_Bodies);
        let c = std::env::args().nth(1).unwrap().parse().unwrap();

        for _ in 0..c {
            // Call to advance has changed:
            advance(
                &mut solar_Bodies,
                &mut position_Deltas,
                &mut magnitudes,
            )
        }

        output_Energy(&mut solar_Bodies);
    }
}

The complete program after this change:

nbody-5a.rs

Safer SIMD

As we saw above, the safety issue with SIMD hinges on the behavior of older processors when executing newly added instructions.

rustc can target a wide variety of processors. When compiling, you can either target a generic “least common denominator” processor, or a specific model, or processors with a particular set of features. Currently, when I’ve built the binaries, I’ve targeted the CPU model native, which is a placeholder meaning “the kind of CPU I’m compiling on.” In my case, native actually means skylake; on your computer, it may represent something else.

If you compile a binary specifically targeting a recent CPU (such as skylake from 2015-ish), and then run it on an older CPU (such as a 2003-vintage AMD Opteron), Bad Things can happen, because the x86 instruction set encoding is simply not backwards-compatible¹. There’s not a lot we can do to defend against this mistake; while we could query the processor’s features at runtime, it’s possible that the code leading up to that check was compiled using new instructions that will execute incorrectly on the older machine.

Lest I appear to be specifically criticizing Intel here, this is a problem on basically every processor architecture. ARM does only slightly better. Only RISC-V went out of its way to ensure that future extensions can be recognized and rejected by older processors.

On the other hand, if you compile a binary targeting an older machine, it should run on that older machine — if it can’t, it should fail to compile. Rust will let your code call _mm_add_pd (which only makes sense on processors that support SSE2) even if your target CPU doesn’t appear to support it — Rust assumes that you’ve done some sort of CPUID check and you know what you’re doing. This is part of the unsafe contract for that operation, a contract that we’re currently fulfilling by sheer luck.

If we can ensure that our code will only compile when targeting processors with SSE2, then we can guarantee that the SIMD operations we’re using are safe. We’ll do that the same way you’d probably do it in C: conditional compilation.

Rust’s conditional compilation mechanism is different from C’s. Rather than a separate macro language evaluated by a preprocessor step, Rust’s is integrated into the language as attributes. By attaching a #[cfg(...)] attribute to something, you make the presence of that thing conditional on certain compile time configuration.

In our case, #[cfg(target_feature = "sse2")] makes something conditional on the compile-time target CPU supporting SSE2.

I’ll use a slightly blunt approach and slap that attribute right on advance, like so:

#[cfg(target_feature = "sse2")]  // <-- only compile for SSE2
unsafe fn advance(
    bodies: &mut [body; BODIES_COUNT],
    position_Deltas: &mut [Interactions; 3],
    magnitudes: &mut Interactions,
){

The reason I’m calling this “blunt” is the way that it fails, if we target a CPU without SSE2². The easiest way to do this is just to disable the sse2 feature on our compiler command line (next to last line of the command):

$ rustc --target=x86_64-unknown-linux-musl \
        -C opt-level=3 -C debuginfo=2 -C codegen-units=1 \
        -C target-cpu=native \
        -C target-feature=-sse2 \
        nbody-5b.rs -o nbody-5b

error[E0425]: cannot find function `advance` in this scope
   --> 5/nbody-5b.rs:294:13
    |
294 |             advance(
    |             ^^^^^^^ not found in this scope

error: aborting due to previous error

For more information about this error, try `rustc --explain E0425`.

The way I’ve expressed the #[cfg] attribute, if the target processor doesn’t support SSE2, the advance function just vanishes — so any code that tries to call it now fails to compile. This is sufficient, but the error message is confusing, and leaves it to the reader to deduce that advance is missing because their processor is too old. In a Real Program I’d handle this slightly differently. See the core::arch module docs for different approaches, including checking for features at runtime, and enabling features in some places but not others.

If you’ve done a lot of SIMD programming on Intel, you may be asking, “but wait, are there any x86-64 CPUs that don’t support SSE2?” In fact, no, I don’t believe there are, so this is just me being paranoid. But if you extended the algorithm to use SSE4 features, this would become very important.

Because this is literally a one-line change, I haven’t included the full program source after the change. I trust you can figure it out if you need to. Don’t worry, there’ll be full source code after the next change.

Turning `unsafe` inside-out

After we got rid of the static mut variables, the only obligation the caller had to meet to use advance safely was to ensure that SSE2 instructions weren’t going to do dumb things on their processor. Now that advance only exists when targeting a CPU with SSE2, it is no longer unsafe to call it. This means that we can remove the unsafe keyword from fn advance, and switch to using more granular unsafe blocks inside the function.

For instance, we can explicitly mark the unsafe bit in the code on the left (which appears about 2/3 of the way through advance), producing the code on the right:

Before:




magnitudes.as_vectors()[i] = _mm_mul_pd(
    _mm_div_pd(_mm_set1_pd(0.01), distance_Squared),
    distance_Reciprocal,
);

After:

// Safety: this code is only compiled for processors that
// support SSE2, so the SIMD operations used here are
// safe.
magnitudes.as_vectors()[i] = unsafe { // <-- note
    _mm_mul_pd(
        _mm_div_pd(_mm_set1_pd(0.01), distance_Squared),
        distance_Reciprocal,
    )
};

(Remember that, in Rust, blocks like unsafe — or even if — can appear where values are expected, on the right-hand side of an assignment.)

Okay, but why is this better? It’s a fair question. There’s still unsafe stuff happening. Code in this unsafe block could be messing with raw pointers, overwriting memory, smashing the stack.

But it isn’t. The unsafe block is small, and contains only calls to SIMD operations and references to local variables. It only takes a few seconds to read the block and see that it’s only using SIMD.

Like when we wrapped unsafe union operations in safe API in part 4, moving to small, focused unsafe blocks makes it much easier to inspect the code and convince ourselves that it’s correct. In particular, by converting the remaining SIMD snippets into granular unsafe blocks, we don’t have to carefully inspect the rest of advance for unsafe operations — we just have to search for unsafe in an editor and read the snippets we find.

By going through advance somewhat mechanically, and wrapping small unsafe blocks around each SIMD expression, we can get advance to compile without itself being marked as an unsafe fn. Here’s the source code at this point:

nbody-5b.rs

(I would prefer to pull these unsafe blocks out of advance entirely and provide a safe SIMD API, like we did for the union in part 4. I’m not doing this right now because it would be a fairly disruptive change in a section that is already quite long, but stay tuned for a future installment.)

Removing the `unsafe` block from `main`

We’re in the home stretch!

Currently, main contains a big unsafe block. It had to, because the functions it was calling were all originally unsafe. But now all our functions are safe, and so it seems like high time to remove that unsafe block:

Before:

fn main() {
    let mut position_Deltas: [Interactions; 3] =
        [Interactions { scalars: [0.; ROUNDED_INTERACTIONS_COUNT] }; 3];
    let mut magnitudes: Interactions =
        Interactions { scalars: [0.; ROUNDED_INTERACTIONS_COUNT] };

    unsafe {
        offset_Momentum(&mut solar_Bodies);
        output_Energy(&mut solar_Bodies);
        let c = std::env::args().nth(1).unwrap().parse().unwrap();
        for _ in 0..c {
            advance(&mut solar_Bodies, &mut position_Deltas, &mut magnitudes)
        }
        output_Energy(&mut solar_Bodies);
    }
}

After…?

fn main() {
    let mut position_Deltas: [Interactions; 3] =
        [Interactions { scalars: [0.; ROUNDED_INTERACTIONS_COUNT] }; 3];
    let mut magnitudes: Interactions =
        Interactions { scalars: [0.; ROUNDED_INTERACTIONS_COUNT] };

    // no unsafe here!
    offset_Momentum(&mut solar_Bodies);
    output_Energy(&mut solar_Bodies);
    let c = std::env::args().nth(1).unwrap().parse().unwrap();
    for _ in 0..c {
        advance(&mut solar_Bodies, &mut position_Deltas, &mut magnitudes)
    }
    output_Energy(&mut solar_Bodies);

}

But, surprise! It won’t compile. And the reason is familiar: the compiler points out that we’re messing around with a static mut, which implies thread safety and reentrancy problems. (This time, the static mut is the global solar_Bodies array.)

“But we’re in main!” you might protest. “main doesn’t need to be reentrant.”

Rust does not agree. main is not special in Rust — it’s just a function, and it needs to abide by the same rules as any other function. For instance, though it would be kind of odd, you could totally call main from another function — which would be reentering it, since all functions are called from main. (This is all true in C, as well.)

Fortunately, this one’s pretty easy to fix. We’re using the static mut variable solar_Bodies to hold the state of the simulation, initialized from some starting conditions. We can change the starting conditions into a const…

Before:

static mut solar_Bodies: [body; BODIES_COUNT]=[
    body {    // Sun
    // ... and so on

After:

const STARTING_STATE: [body; BODIES_COUNT]=[
    body {    // Sun
    // ... and so on

…and then switch to holding our running simulation state in a local, instead of a static:

Before:

fn main() {


    let mut position_Deltas: [Interactions; 3] =
        [Interactions { scalars: [0.; ROUNDED_INTERACTIONS_COUNT] }; 3];
    let mut magnitudes: Interactions =
        Interactions { scalars: [0.; ROUNDED_INTERACTIONS_COUNT] };

    unsafe {
        offset_Momentum(&mut solar_Bodies);
        output_Energy(&mut solar_Bodies);
        let c = std::env::args().nth(1).unwrap().parse().unwrap();
        for _ in 0..c {
            advance(&mut solar_Bodies, &mut position_Deltas, &mut magnitudes)
        }
        output_Energy(&mut solar_Bodies);
    }
}

After:

fn main() {
    let mut solar_Bodies = STARTING_STATE; // <-- new!

    let mut position_Deltas: [Interactions; 3] =
        [Interactions { scalars: [0.; ROUNDED_INTERACTIONS_COUNT] }; 3];
    let mut magnitudes: Interactions =
        Interactions { scalars: [0.; ROUNDED_INTERACTIONS_COUNT] };

    // no unsafe here!
    offset_Momentum(&mut solar_Bodies);
    output_Energy(&mut solar_Bodies);
    let c = std::env::args().nth(1).unwrap().parse().unwrap();
    for _ in 0..c {
        advance(&mut solar_Bodies, &mut position_Deltas, &mut magnitudes)
    }
    output_Energy(&mut solar_Bodies);

}

And now main is entirely safe.

The program with these changes:

nbody-5.rs

Evaluation

This has been quite a sprint! What have we accomplished?

advance is now safe to call.
The program will (correctly) fail to compile when targeting a CPU without SSE2.
The remaining unsafe bits inside advance are now granular and easy to inspect.
main is now safe and reentrant.

What effect does it have on the program as a whole? I’ve compiled a version at each step in this section (5a, 5b, and finally 5). First, let’s check their sizes:

$ size nbody-{4,5a,5b,5}
   text	   data	    bss	    dec	    hex	filename
 265608	  10332	   7432	 283372	  452ec	nbody-4.bench
 265808	  10332	   7112	 283252	  45274	nbody-5a.bench
 265808	  10332	   7112	 283252	  45274	nbody-5b.bench
 265888	  10044	   7112	 283044	  451a4	nbody-5.bench

Removing the static mut arrays inside advance reduced the bss RAM usage, which makes sense, as bss measures permanently dedicated (static) sections of RAM. This isn’t a real reduction in RAM usage, because the arrays are simply moved to the stack, which isn’t accounted for here.

The text section grew by 80 bytes at the point where we started initializing solar_Bodies from a const, instead of statically.

So, you win some, you lose some; overall, the binary is 328 bytes smaller.

More importantly, how’s the performance? Let’s measure at each step:

Command	Mean [s]	Min [s]	Max [s]	Ratio
`./nbody.clang-8.bench 50000000`	5.277 ± 0.007	5.258	5.282	1.00x
`./nbody-1.bench 50000000`	5.123 ± 0.024	5.095	5.161	0.97x
`./nbody-2.bench 50000000`	5.101 ± 0.005	5.093	5.107	0.97x
`./nbody-3.bench 50000000`	5.103 ± 0.002	5.100	5.105	0.97x
`./nbody-4.bench 50000000`	5.104 ± 0.002	5.101	5.107	0.97x
`./nbody-5a.bench 50000000`	5.145 ± 0.003	5.140	5.150	0.97x
`./nbody-5b.bench 50000000`	5.147 ± 0.002	5.145	5.149	0.98x
`./nbody-5.bench 50000000`	5.121 ± 0.003	5.118	5.127	0.97x

In other words,

Removing the optimization of using static tables in advance, and switching to locals allocated in main, had a cost (nbody-5a): runtime increased by 0.8% (41ms) over the code from part 4³.
Rearranging unsafe code (nbody-5b) had a negligible effect, as we’d expect.
Switching from using an unsafe static mut for the solar system state, to a local (nbody-5), won us back most of the performance we lost.

The final “pretty safe” version of the program is slightly faster than the original Rust translation, and significantly (3%) faster than the C original.

And it’s 0.3% / 20ms slower than nbody-4. If that 20ms is important to you, you could always stop with nbody-4 and skip the changes made in this section. One of the nice things about Rust is that, while Rust encourages you to make programs safe, it’s not required. nbody-4 is a pretty reasonable place to leave things, if nbody-5 doesn’t meet your performance needs.

Why are locals more costly than statics if we’re not paying to initialize them? It appears to come down to code density and addressing modes on x86-64. Because the address of a static is known during build (at link time), rustc can emit instructions that directly reference it with embedded absolute addresses. With a local, we have to compute its address on the stack (using the %sp register) to reference it. The latter approach appears to produce less-dense code in this case.

This isn’t an absolute truth (pun intended): locals are usually faster on ARM, where instructions can’t embed an absolute memory address cheaply. Measure, measure, measure.

Wrapping up

We’ve reached our destination! Let’s look back at what we’ve done in the series so far.

Starting with a C program, we’ve produced an equivalent Rust program, following an incremental process without sweeping architectural changes. Comparing the two,

The Rust code doesn’t use any pointer arithmetic or unchecked array indexing, which eliminates bugs like buffer overruns, stack smashes, and the like.
The Rust code doesn’t use raw pointers at all, relying on Rust references to eliminate the possibility of NULL pointer bugs or accesses to uninitialized memory.
While we don’t use threads, the Rust code is fully-reentrant and thread safe, so we could operate in a threaded environment if needed.
The Rust program is slightly faster than the C program compiled by Clang (by about 3%), and significantly faster than GCC’s output (by 17%).
The Rust source code is longer, mostly due to the need to call Intel SIMD intrinsics directly, while the C program relies on non-standard operator overloading.
When both programs are statically linked⁴, the Rust binary is smaller — 40% the size of the GCC/Clang output.
Compile times are roughly equivalent (GCC is slightly faster, Clang, slightly slower).

While we’ve finished the task we originally set out to accomplish, I’m not quite done yet. The current Rust code is a close equivalent to the C code, but it’s very weird-looking Rust code. In the next (bonus!) section, I’ll look at how the program would differ if it were written in normal, idiomatic Rust.

⁴

I’m not playing games by statically linking, the author of the C program was doing it before I got here. That being said, a Rust program on Linux using dynamic linking will tend to be larger than a C program doing the same thing, simply because C assumes that a compatible version of its standard library is already installed on the system, and Rust does not — so std is always linked in. This can make a big difference for small binaries. (I point this out so it doesn’t seem like I’m glossing over Rust binary size issues, which can be a problem in some use cases.)

#c #rust #tutorial

Cliffle

The state of advance

Mutable statics

SIMD intrinsics

Getting to safety

Getting the static out

Safer SIMD

Turning unsafe inside-out

Removing the unsafe block from main

Evaluation

Wrapping up

The state of `advance`

Turning `unsafe` inside-out

Removing the `unsafe` block from `main`