apenwarr-redo/docs/FAQParallel.md

# Parallelism if more than one target depends on the same subdir

Recursive make is especially painful when it comes to
parallelism.  Take a look at this Makefile fragment:

	all: fred bob
	subproj:
		touch $@.new
		sleep 1
		mv $@.new $@
	fred:
		$(MAKE) subproj
		touch $@
	bob:
		$(MAKE) subproj
		touch $@

If we run it serially, it all looks good:

	$ rm -f subproj fred bob; make --no-print-directory
	make subproj
	touch subproj.new
	sleep 1
	mv subproj.new subproj
	touch fred
	make subproj
	make[1]: 'subproj' is up to date.
	touch bob

But if we run it in parallel, life sucks:

	$ rm -f subproj fred bob; make -j2 --no-print-directory
	make subproj
	make subproj
	touch subproj.new
	touch subproj.new
	sleep 1
	sleep 1
	mv subproj.new subproj
	mv subproj.new subproj
	mv: cannot stat 'ubproj.new': No such file or directory
	touch fred
	make[1]: *** [subproj] Error 1
	make: *** [bob] Error 2

What happened?  The sub-make that runs `subproj` ended up
getting twice at once, because both fred and bob need to
build it.

If fred and bob had put in a *dependency* on subproj, then
GNU make would be smart enough to only build one of them at
a time; it can do ordering inside a single make process.
So this example is a bit contrived.  But imagine that fred
and bob are two separate applications being built from the
same toplevel Makefile, and they both depend on the library
in subproj.  You'd run into this problem if you use
recursive make.

Of course, you might try to solve this by using
*nonrecursive* make, but that's really hard.  What if
subproj is a library from some other vendor?  Will you
modify all their makefiles to fit into your nonrecursive
makefile scheme?  Probably not.

Another common workaround is to have the toplevel Makefile
build subproj, then fred and bob.  This works, but if you
don't run the toplevel Makefile and want to go straight
to work in the fred project, building fred won't actually
build subproj first, and you'll get errors.

redo solves all these problems.  It maintains global locks
across all its instances, so you're guaranteed that no two
instances will try to build subproj at the same time.  And
this works even if subproj is a make-based project; you
just need a simple subproj.do that runs `make subproj`.


# Dependency problems that only show up during parallel builds

One annoying thing about parallel builds is... they do more
things in parallel.  A very common problem in make is to
have a Makefile rule that looks like this:

	all: a b c

When you `make all`, it first builds a, then b, then c.
What if c depends on b?  Well, it doesn't matter when
you're building in serial.  But with -j3, you end up
building a, b, and c at the same time, and the build for c
crashes.  You *should* have said:

	all: a b c
	c: b
	b: a

and that would have fixed it.  But you forgot, and you
don't find out until you build with exactly the wrong -j
option.

This mistake is easy to make in redo too.  But it does have
a tool that helps you debug it: the --shuffle option.
--shuffle takes the dependencies of each target, and builds
them in a random order.  So you can get parallel-like
results without actually building in parallel.


# What about distributed builds?

FIXME:
So far, nobody has tried redo in a distributed build environment.  It surely
works with distcc, since that's just a distributed compiler.  But there are
other systems that distribute more of the build process to other machines.

The most interesting method I've heard of was explained (in public, this is
not proprietary information) by someone from Google.  Apparently, the
Android team uses a tool that mounts your entire local filesystem on a
remote machine using FUSE and chroots into that directory.  Then you replace
the $SHELL variable in your copy of make with one that runs this tool.
Because the remote filesystem is identical to yours, the build will
certainly complete successfully.  After the $SHELL program exits, the changed
files are sent back to your local machine.  Cleverly, the files on the
remote server are cached based on their checksums, so files only need to be
re-sent if they have changed since last time.  This dramatically reduces
bandwidth usage compared to, say, distcc (which mostly just re-sends the
same preparsed headers over and over again).

At the time, he promised to open source this tool eventually.  It would be
pretty fun to play with it.

The problem:

This idea won't work as easily with redo as it did with
make.  With make, a separate copy of $SHELL is launched for
each step of the build (and gets migrated to the remote
machine), but make runs only on your local machine, so it
can control parallelism and avoid building the same target
from multiple machines, and so on.  The key to the above
distribution mechanism is it can send files to the remote
machine at the beginning of the $SHELL, and send them back
when the $SHELL exits, and know that nobody cares about
them in the meantime.  With redo, since the entire script
runs inside a shell (and the shell might not exit until the
very end of the build), we'd have to do the parallelism
some other way.

I'm sure it's doable, however.  One nice thing about redo
is that the source code is so small compared to make: you
can just rewrite it.


# Can I convince a sub-redo or sub-make to *not* use parallel builds?

Yes.  Put this in your .do script:

	unset MAKEFLAGS

The child makes will then not have access to the jobserver,
so will build serially instead.


<a name='MAKEFLAGS'></a>
# What does the "broken --jobserver-auth" error mean?

redo (and GNU make) use the `MAKEFLAGS` environment variable to pass
information about the parallel build environment from one process to the
next.  Inside `MAKEFLAGS` is a string that looks like either
`--jobserver-auth=X,Y` or `--jobserver-fds=X,Y`, depending on the version of
make.

If redo finds one of these strings, but the file descriptors named by `X`
and `Y` are *not* available in the subprocess, that means some ill-behaved
parent process has closed them.  This prevents parallelism from working, so
redo aborts to let you know something is seriously wrong.

GNU make will intentionally close these file descriptors if you write a
Makefile rule that contains *neither* the exact string `$(MAKE)` nor a
leading `+` character.  So you might have had a Makefile rule that looked
like this:

	subdir/all:
		$(MAKE) -C subdir all

and that worked as expected: the sub-make inherited your parallelism
settings.  But people are sometimes surprised to find that this doesn't work
as expected:

	subdir/all:
		make -C subdir all

In that case, the sub-make does *not* inherit the jobserver file
descriptors, so it runs serially.  If for some reason you don't want to use
`$(MAKE)` but you do want parallelism, you need to write something like this
instead:

	subdir/all:
		+make -C subdir all

And similarly, if you recurse into redo instead of make, you need the same
trick:

	subdir/all:
		+redo subdir/all

There are a few other programs that also close file descriptors.  For
example, if your .do file starts with `#!/usr/bin/env xonsh`, you might
run into [a bug in xonsh where it closes file descriptors
incorrectly](https://github.com/xonsh/xonsh/issues/2984).

If you really can't stop your program from closing file descriptors that it
shouldn't, you can work around the problem by unsetting `MAKEFLAGS`.  This
will let your program build, but will disable parallelism.