Me wearing ridiculous goggles

The C Production Model

C is a language dating from the early 1970s. You may have heard of it.

C is relatively easy to compile, but surprisingly difficult to build:

  • It has no module system.

  • Its traditional approach to modularity uses textual inclusion and arbitrary preprocessing of source code.

  • Its linkers are crazy simple, in a way that moves the complexity of linking large programs into the build system.

This page talks about the model Cobble uses to turn C program descriptions into binaries. The focus here is on GCC, but the concepts should translate to most other toolchains.

This is not a discussion of how to use Cobble to build your C program. It’s mostly of interest to people who write build systems or want to write their own production models.

Making Object Code from Source Files

In the beginning, there’s the C source file. Convention dictates that each source file in a C program can be compiled individually, a notion called separate compilation. (The alternative is to mash all your C files into one big file and then compile that. Cobble does not support this out of the box.)

When we compile a C source file, we get an object file out the other end. We can do this using a command line like the following:

gcc (some flags) -c source.cc -o object.o

The flags may control include paths, optimization levels, CPU architecture, and the like.

The output may be affected by:

  • Things on the filesystem: the source file and files (transitively) included.
  • The choice of which compiler binary to run.
  • The flags.

That’s it — C compilers don’t (for example) read the process environment variables and make those available during compilation. (Thank goodness.)

There are two different aspects to the “things on the filesystem” part:

  • Which files are involved, and
  • What those files contain.

In C, these are intertwined: any file can #include another file. In fact, all these aspects are intertwined. The files that get #included may be different depending on the flags. Different compilers ship with different default include paths, and different default flags.

Tracking filesystem accesses and changes to files is Ninja’s job. Cobble is concerned with the other two parts. The environment keys that control object file production are:

  • cc: the name of the C compiler binary, preferably absolute, but possibly relative to the user’s PATH.

  • c_flags: the flags, as a list of strings. These will appear (separated by spaces) in the command line.

  • c_depswitch: a special setting, generated inside the target, that controls the scope of included-file change tracking used by the compiler. This is governed by the c_deps_include_system environment key and can’t be set directly.

  • __order_only__, a general Cobble key used to track dependencies on generated headers.

When producing a C object, we minimize the environment to those four things.

Notice that the environment doesn’t contain the path to the source file. Cobble encodes this separately in the name of the object file, e.g.

env/12345/foo/bar/baz.c.o

…so it doesn’t need to be in the hash.

Note also that Cobble leaves the original file extension and just adds .o. This means the environment doesn’t need to indicate whether the source file was written in C or C++ or D or Fortran or… but can still prevent collisions when a single directory contains source files with the same name written in different languages. (Yes, this does happen in real projects. Think: code generation.)

Note that we’re leaning on Ninja pretty hard here to simplify Cobble. Ninja is responsible for noticing:

  • Changes to which files are included by the root C file (the file itself must change).

  • Changes to those included files (Ninja records include activity and tracks changes there).

  • Changes to the path to the input file, which can happen when consuming files generated by a tool as part of the build (Ninja fingerprints command lines and will rebuild on any changes).

This leaves Cobble in charge of distinguishing different current versions of the object file, which is what Cobble is for.

Producing a Static Archive

By default, Cobble produces a static archive (.a files) for each c_library in the project. (It actually may produce many static archives: one for each environment in which the library gets built.)

This behavior gets disabled in any environment where c_library_archive_products is defined to False. In such environments, libraries are left as bags of object files until linked into a program. But ignore this for now.

As with object files, we can guess at the shape of the minimized environment by looking at the command line and thinking about what it does. Here’s a typical Unix static archive command line:

ar rcs libfoo.a object1.o object2.o object3.o

The s in the flags is important: it asks the ar tool to generate an index. Without an index, the order of the object files inside the archive is significant. Cobble could derive the correct ordering of objects, but to do this it would need information about dependencies between source files inside a single target. This seems annoying to maintain, so we don’t do it.

So, then: what affects the output of ar? It can’t include files or process variables. Its output is determined solely by its inputs.

There’s some subtlety here. The ar command line above is simplified. In a real Cobble build, it’s more likely to resemble:

ar rcs env/12345/libfoo.a env/abcde/object1.o env/2134abc/object2.o ...

Remember that we may produce more than one static archive for a given library: because the objects it contains are different.

It’s sufficient to distinguish archives by the list of objects they contain, including the environment hashes. (Because we hash these hashes, we’re effectively creating a hash tree.)

But it’s also possible to have ar tools that behave differently. So the minimized archive environment contains:

  • ar: name of the archiver binary, preferably absolute.

  • ar_inputs: a synthetically generated key containing the sorted list of input objects. (Here, “synthetically generated” means it’s produced inside the target and cannot be affected externally.)

We sort ar_inputs to maximize work stealing: we want to guarantee that we reuse archives whenever their contents are identical. Because of the index, the order of inputs is not significant, so we can discard this information when computing the hash.

We’re not leaning on Ninja particularly hard for archiving.

Linking a Binary

Binaries are produced from both C files and libraries. The C files are turned into objects as described above; for the purposes of this discussion, you can pretend we then archive them — the results are similar.

To link a binary, we wind up invoking the C compiler again, in a different mode. (We could also invoke a linker directly, but this is not what Cobble does.)

The command line resembles the following:

gcc (link flags) -o my_binary libfoo.a libbar.a libbaz.a

Of course, both the binary path and the library paths are qualified by environment hashes, but you get the idea.

We’re in a very similar situation to object file production, except that in this mode, the compiler won’t #include more stuff. The output is determined by:

  • The choice of compiler,
  • The link flags,
  • What libraries are being linked in, and, unfortunately,
  • What order the libraries appear in.

Linking a C program is order-sensitive. The linker itself is gloriously simple: it operates in a single pass from left to right. At each object file or archive it:

  1. Checks if the input contains anything that matches a table of currently unresolved symbols. (Initially the unresolved symbols are given by the compiler/linker configuration — typically something like main).

  2. Links in those parts of the input. Depending on configuration (e.g. GNU ld’s --gc-sections) it may discard the rest.

  3. Records any undefined symbols present in the parts of the input that it kept around.

At the right end of the list, if there are undefined symbols, it fails.

This means that the linker’s inputs must be specified from leaf to root using a topological sort. This requirement causes a lot of the complexity in Cobble’s implementation, and it’s something that most other build systems for C programs either get wrong (Gyp, GN) or foist onto the user (SCons, Make).

The solution relies on the environment.

Each c_library target’s using delta appends its archive path to two environment keys: __implicit__ and link_srcs. __implicit__ is a general Cobble key that describes Ninja implicit dependencies; this is how the binary comes to depend on the library build. link_srcs is C-specific. Both keys often contain the same list of libraries — so why are they separate?

Because sometimes link_srcs is different.

  • Some libraries may need to override the linker’s discarding of input sections by adding e.g. --whole-archive / --no-whole-archive around their archive in the linker command line. Since these switches aren’t build products, they can’t be added to __implicit__.

  • Some “libraries” actually represent system libraries, and instead of naming an archive, they’ll add something like -lpthread to link_srcs. Again, -lpthread is a switch, not a product, and cannot appear in __implicit__.

At the leaf target (the binary), the using deltas of the entire DAG are concatenated in depth-first topological order and applied to the leaf environment. This produces both __implicit__ and link_srcs keys ready for insertion into a command line.

Of course, libraries may produce other keys in their using deltas. The other one consumed during the link stage is link_flags, which appears before the link_srcs and can contain general linker control switches like --gc-sections.

So the minimized binary link environment contains:

  • cc (at the moment), the preferably absolute path to the compiler;
  • link_flags, a list of strings.
  • link_srcs, a list of strings naming archives or external -lfoo flags.

Note that, after all the discussion of __implicit__ above, it’s not in the list. This is because it exists purely to help Ninja derive a build ordering, and cannot contribute to “forking” a binary into different versions by environment. It is an output of the environment and DAG. link_srcs contains all the information we need.

Note also that, unlike ar_inputs above, link_srcs is explicitly not synthetically generated inside the target. It’s collected across the whole transitive DAG — it has to be. We should probably come up with a clear naming convention to distinguish these types of keys.

Finally, an aside about objects vs. archives. When linking objects directly — e.g. for source files that belong directly to a binary target, or when archiving is disabled via c_library_archive_products — the discussion above holds, but link_srcs and __implicit__ accumulate object file paths instead of archive paths.

What About C++? Assembly?

This same model works for several languages — at minimum:

  • C++
  • D
  • Fortran
  • Assembly

Basically, any language that relies on a traditional Unix-style linker, and which has a direct source-file-to-object-file correspondence. (This rules out traditional Java compilers, for example.)

Currently, Cobble’s C support plugin (cobble.target.c) has a hardcoded mapping of source file extension to environment minimization maps. These maps are defined for C, C++, and Assembly with C Preprocessing (aspp). Only the object file production behavior changes; archiving and linking are language-independent.

The relevant keys for each language are:

  • C: cc, c_flags, c_depswitch, __order_only__
  • C++: cxx, cxx_flags, c_depswitch, __order_only__
  • Assembly: aspp, aspp_flags, c_depswitch, __order_only__

More Cliffle

By Topic