Progress report - prime path coverage in gcc 2

Progress report - prime path coverage in gcc 2

June 25, 2024

It’s time for another progress report. The last post covered the phases at a high level – please read that first for an overview.

Observing paths

To observe paths we associate a bitset with each function, where each bit corresponds to a path. Which bit maps to which path is simply its index in the lexicographically sorted set of paths. When an edge is taken, the a mask is applied to the bitset so the paths not taken are masked out: accu[block] = accu[pred] & paths[block]. Finally, paths that start in the block are masked in accu[block] = accu[block] | starts[block]. Because the bitset may be larger than a single host native integer, larger bitsets are broken down into “buckets”, and every block is instrumented with code that is roughly equivalent to this:

for (b : blocks):
  for (x : buckets):
	maybe-flush (accu[b][x])
	accu[b][x] = (accu[b][x] & accu[pred][x]) | starts[b][x]

Doing less work

The first revision of this patch was terribly slow and created massive binaries. Some of this is inevitable - as the number of paths grow large so does the number of instructions we need just to keep track of the paths we are currently on. I focused on reducing unnecessary work which had a big impact The previous version built tree.c in just under 10 minutes and created a 50M tree.o – the current version builds the same file in approx. 2m30s and results in a 32M tree.o. This improvement came from applying these techniques:

Subset aware instructions. Most basic blocks are in just a (much smaller) subset of the paths, which means we don’t need to apply updates to unaffected buckets. This means generating fewer pointless assignments and phis.

Removing ineffective masks. If all paths in the bucket goes through a basic block there is no need to emit AND instructions as it would not mask out any bits. x & ~0 == x. An interesting extension of this is not emitting individual ANDs for the phi when multiple predecessors all apply the same mask. In this case no mask needs to be applied at all as it would not unset or undo any path. There must be an effective discriminating AND later and the masking can be put off until then.

Straight line optimizations for the subpaths without any branching. In this case we only need to apply the mask once for the full chain, as repeated applications are ineffective. In code this would be (x & mask & mask) == (x & mask)

Replacing bitwise operations with constants. This is typically the case when paths start in a basic block, but no other paths in the same bucket go through that basic block. In this case we can replace 0 | mask with just mask.

Experimenting with the optimizer

Running this version through the optimizer with gcc -Os adds about 10 minutes to the build time. Interestingly, the effect on build time is less for the previous version - it only adds about 5 minutes. For both version the final object size is reduced objects to roughly the same size (approx 16M, down from 32M and 50M respectively). This means the current version is at about 2x optimal (in space) which leaves room for improvement but is probably still ok.