Measure What You Optimize

Learn Rust the Dangerous Way, Part 3

(Series Overview)

In part 2 we introduced Rust references, and this was enough to convert one of our inner functions into safe Rust.

The others are still unsafe. There are several reasons for this. In this, the briefest of sections, we’ll tackle the easiest one: deliberate use of uninitialized memory.

Refreshing your (uninitialized) memory

At the end of part 2, our program looked like this:

There are two places in the original C program — and, thus, in our program — where the author deliberately left a local variable uninitialized, to be filled in later. While the choice isn’t explained within the program, I can make an educated guess as to why it was done this way: performance.

Often, the easiest way to improve the performance of your software is to make it do less. For example, imagine a program that is decompressing a gigabyte of data into RAM, using a single 1GiB allocation. The program could go through and zero the gigabyte before decompressing, but this is technically wasted effort: it’s going to overwrite every last one of those zeros with valid data.

Many safe or higher-level languages don’t give the programmer a choice: uninitialized memory is dangerous, so your memory is going to be initialized. Rust and C let the programmer decide, and this can be a useful tool for optimizing memory accesses. For instance, Vec from Rust’s standard library uses uninitialized memory under the hood when you ask it to pre-allocate space for elements without filling them in, among other times.

In C, uninitialized variables are easy to overlook, because you’re looking for something that isn’t there (the initializer). One simplified example from nbody:

double position_Delta[3];
for(intnative_t m=0; m<3; ++m)
    position_Delta[m]= initial_value_here;

When declaring a local, double position_delta[3]; declares an uninitialized array. The most convenient thing to type is also, usually, the wrong and dangerous one.

As we saw in part 1, Rust requires every variable to be initialized, and if you’d like it to not do that, you have to be very explicit:

let mut position_Delta = [mem::MaybeUninit::<__m128d>::uninit(); 3];
for m in 0..3 {
    position_Delta[m].as_mut_ptr().write(
        initial_value_here
    );
}
let position_Delta: [__m128d; 3] = mem::transmute(position_Delta);

That’s a lot of code, though, and it distracts from what we’re actually trying to do (initialize the array) with an irrelevant task (managing uninitialized memory). Compare that to initializing the array with zeros and immediately overwriting it:

let mut position_Delta = [0.; 3];
for m in 0..3 {
    position_Delta[m] = initial_value_here;
}

That’s much shorter and clearer! But taken at face value, it’s doing around 2x the work.

Is it, though? There’s one way to find out, and that’s to try it.

Measuring

If we replace all uses of uninitialized memory with the simpler equivalent, how do the results compare to nbody-2? I usually start by checking the sizes:

$ size nbody-2 nbody-3
   text	   data	    bss	    dec	    hex	filename
 226776	  10744	    888	 238408	  3a348	nbody-2
 226776	  10744	    888	 238408	  3a348	nbody-3

Um. So they’re exactly the same size. That’s often a hint that the compiler output didn’t change. Here’s an easy way to test:

$ diff <(objdump -d --demangle nbody-2) <(objdump -d --demangle nbody-3)
2c2
< nbody-2:     file format elf64-x86-64
---
> nbody-3:     file format elf64-x86-64

Only the name of the program has changed. (Since the filename doesn’t affect the performance of the program, the performance has also not changed.)

The compiler has noticed that we initialized the array and then overwrote it completely. The zeros stored to the array were dead stores, and the compiler applied dead store elimination to get rid of them. This is not unique to rustc: back-porting this change to the C program shows that gcc does the same thing.

For small and predictably-sized structures — the sort of thing you’d allocate on the stack — it’s almost never a performance win to use uninitialized memory. Dead store elimination is the sort of thing modern compilers can do while wearing a blindfold and riding a unicycle.

A safe version of output_Energy

Not only is the new version of the code simpler, it’s also safe. The code no longer relies on features that, if misused, can break memory safety — like uninitialized memory.

In fact, the use of uninitialized memory was the only unsafe bit remaining in output_Energy, which we can check by deleting the unsafe keyword from the top of the function and recompiling. If it compiles, we’re good.

Here’s the new safe version of output_Energy:

fn output_Energy(bodies: &mut [body; BODIES_COUNT]){
    let mut energy = 0.;
    for i in 0..BODIES_COUNT {

        // Add the kinetic energy for each body.
        energy += 0.5 * bodies[i].mass * (
            bodies[i].velocity[0] * bodies[i].velocity[0] +
            bodies[i].velocity[1] * bodies[i].velocity[1] +
            bodies[i].velocity[2] * bodies[i].velocity[2]);

        // Add the potential energy between this body and every other body.
        for j in i+1..BODIES_COUNT {
            let mut position_Delta = [0.; 3];
            for m in 0..3 {
                position_Delta[m] =
                    bodies[i].position[m] - bodies[j].position[m];
            }

            energy -= bodies[i].mass * bodies[j].mass / f64::sqrt(
              position_Delta[0] * position_Delta[0] +
              position_Delta[1] * position_Delta[1] +
              position_Delta[2] * position_Delta[2]);
        }
    }

    // Output the total energy of the system.
    println!("{:.9}", energy);
}

You might be wondering why we care. The program worked fine with the unsafe version of this function, and every C function is the equivalent of unsafe.

The answer comes back to local reasoning. Since the function is missing the unsafe modifier, and contains no unsafe blocks in its body, I don’t need to read the rest of the program to tell that this function can’t…

This means less debugging for us, and fewer possible security exploits for our users.

Here’s the entire program with the safe output_Energy in context:

Its performance matches the program from part 2, as you’d expect.

CommandMean [s]Min [s]Max [s]Ratio
./nbody.clang-8.bench 500000005.277 ± 0.0075.2585.2821.00x
./nbody-1.bench 500000005.123 ± 0.0245.0955.1610.97x
./nbody-2.bench 500000005.101 ± 0.0055.0935.1070.97x
./nbody-3.bench 500000005.103 ± 0.0025.1005.1050.97x

What have we learned?

Optimizations usually add complexity. The complexity is often worthwhile. But if the “optimization” doesn’t improve performance, then we’ve just added complexity for no good reason. In the case of optimizations that do subtle things with memory, there’s also a real risk that the added complexity introduces crashing bugs and security flaws.

Unless you’re writing assembly language, your program is not a list of literal instructions that a computer will obey. It will be analyzed, massaged, optimized, scheduled, etc. by an intermediary first — in the case of Rust and C, that’s the compiler.

This is why, if you have demanding performance requirements, it’s critical to check that your optimizations have done what you expect, by looking — directly or indirectly — at the compiler’s output.

In this case, a reasonable-looking optimization applied in the C program turns out to have been unnecessary (in both C and Rust). Because the optimization required unsafe features, and safe code makes us think and type less, Rust gave us an incentive to double- and triple-check this optimization. Without this incentive, we might not have tried.

Several times, I’ve applied “optimizations” in Rust programs that used unsafe operations in clever ways, only to find out that the equivalent, “naïve” safe code is just as fast or even faster. While I may be clever, the compiler is also clever, and the performance of my program is the result of a dialog between the two of us.

Where next?

I’m going to keep sanding down the unsafe parts of this program. We’ve covered the simple cases; the uninitialized variables in this section were the last bits that can be trivially made safe by massaging the code. Now we have to turn our focus to the advance function.

The optimizations used in advance require unsafe code. Instead of eliminating it, so that the compiler can prove our code safe, we must switch to a more subtle technique: corralling it into small sections, so that we, the programmer, can verify it as safe with local reasoning.

This is the topic of part 4.