Progress report - prime path coverage in gcc 2
It’s time for another progress report. The last post covered the phases at a high level – please read that first for an overview.
Observing paths
To observe paths we associate a bitset with each function, where each
bit corresponds to a path. Which bit maps to which path is simply its
index in the lexicographically sorted set of paths. When an edge is
taken, the a mask is applied to the bitset so the paths not taken
are masked out: accu[block] = accu[pred] & paths[block]
. Finally,
paths that start in the block are masked in accu[block] = accu[block] | starts[block]
. Because the bitset may be larger than a
single host native integer, larger bitsets are broken down into
“buckets”, and every block is instrumented with code that is roughly
equivalent to this:
for (b : blocks):
for (x : buckets):
maybe-flush (accu[b][x])
accu[b][x] = (accu[b][x] & accu[pred][x]) | starts[b][x]
Doing less work
The first revision of this patch was terribly slow and created massive binaries. Some of this is inevitable - as the number of paths grow large so does the number of instructions we need just to keep track of the paths we are currently on. I focused on reducing unnecessary work which had a big impact The previous version built tree.c in just under 10 minutes and created a 50M tree.o – the current version builds the same file in approx. 2m30s and results in a 32M tree.o. This improvement came from applying these techniques:
Subset aware instructions. Most basic blocks are in just a (much smaller) subset of the paths, which means we don’t need to apply updates to unaffected buckets. This means generating fewer pointless assignments and phis.
Removing ineffective masks. If all paths in the bucket goes
through a basic block there is no need to emit AND instructions as it
would not mask out any bits. x & ~0 == x
. An interesting extension
of this is not emitting individual ANDs for the phi when multiple
predecessors all apply the same mask. In this case no mask needs to be
applied at all as it would not unset or undo any path. There must be
an effective discriminating AND later and the masking can be put off
until then.
Straight line optimizations for the subpaths without any
branching. In this case we only need to apply the mask once for the
full chain, as repeated applications are ineffective. In code this
would be (x & mask & mask) == (x & mask)
Replacing bitwise operations with constants. This is typically the
case when paths start in a basic block, but no other paths in the same
bucket go through that basic block. In this case we can replace 0 | mask
with just mask
.
Experimenting with the optimizer
Running this version through the optimizer with gcc -Os
adds about
10 minutes to the build time. Interestingly, the effect on build time
is less for the previous version - it only adds about 5 minutes.
For both version the final object size is reduced objects to roughly
the same size (approx 16M, down from 32M and 50M respectively). This
means the current version is at about 2x optimal (in space) which
leaves room for improvement but is probably still ok.