redo now saves the stderr from every .do script, for every target, into
a file in the .redo directory. That means you can look up the logs
from the most recent build of any target using the new redo-log
command, for example:
redo-log -r all
The default is to show logs non-recursively, that is, it'll show when a
target does redo-ifchange on another target, but it won't recurse into
the logs for the latter target. With -r (recursive), it does. With -u
(unchanged), it does even if redo-ifchange discovered that the target
was already up-to-date; in that case, it prints the logs of the *most
recent* time the target was generated.
With --no-details, redo-log will show only the 'redo' lines, not the
other log messages. For very noisy build systems (like recursing into
a 'make' instance) this can be helpful to get an overview of what
happened, without all the cruft.
You can use the -f (follow) option like tail -f, to follow a build
that's currently in progress until it finishes. redo itself spins up a
copy of redo-log -r -f while it runs, so you can see what's going on.
Still broken in this version:
- No man page or new tests yet.
- ANSI colors don't yet work (unless you use --raw-logs, which gives
the old-style behaviour).
- You can't redirect the output of a sub-redo to a file or a
pipe right now, because redo-log is eating it.
- The regex for matching 'redo' lines in the log is very gross.
Instead, we should put the raw log files in a more machine-parseable
format, and redo-log should turn that into human-readable format.
- redo-log tries to "linearize" the logs, which makes them
comprehensible even for a large parallel build. It recursively shows
log messages for each target in depth-first tree order (by tracing
into a new target every time it sees a 'redo' line). This works
really well, but in some specific cases, the "topmost" redo instance
can get stuck waiting for a jwack token, which makes it look like the
whole build has stalled, when really redo-log is just waiting a long
time for a particular subprocess to be able to continue. We'll need to
add a specific workaround for that.
The way the code was written, we'd give up our token, detect a cyclic
dependency, and then try to get our token back before exiting. Even
with -j1, the temporary token release allowed any parent up the tree to
continue running jobs, so it would take an arbitrary amount of time
before we could exit (and report an error code to the parent).
There was no visible symptom of this except that, with -j1, t/355-deps-cyclic
would not finish until some of the later tests finished, which was
surprising.
To fix it, let's just check for a cyclic dependency first, then release
the token only once we're sure things are sane.
I think we were sometimes leaving half-done sqlite transactions sitting
around for a long time (eg. across sub-calls to .do files). This
seemed to be okay on Linux, but caused sqlite deadlocks on MacOS. Most
likely it's not the operating system, but the sqlite version and
journal mode in use.
In any case, the correct thing to do is to actually commit or rollback
transactions, not leave them hanging around.
...unfortunately this doesn't actually fix my MacOS deadlocks, which
makes me rather nervous.
GNU make post-4.2 renamed the --jobserver-fds option to
--jobserver-auth. For compatibility with both older and newer
versions, when we set MAKEFLAGS we set both, and when we read MAKEFLAGS
we will accept either one.
Also, when MAKEFLAGS was not already set, redo would set a MAKEFLAGS with a
leading 'None' string, which was incorrect. It should be the empty
string instead.
That makes it a little easier to tell, in a strace, what the process is
waiting on. If it's 100/101, then it's waiting on a token; 50+ means waiting
on a subtask.
Also, we weren't closing the read side of subtask fds on exec. This didn't
cause any problems, but did result in a wasted fd in subprocesses.
Although we were deadlock-free before, under some circumstances we'd end up
holding a perfectly good token while in sync wait; that would reduce our
parallelism for no good reason. So give back our tokens before waiting for
anybody else.
It creates a race condition: GNU Make might try to read while the socket is
O_NONBLOCK, get EAGAIN, and die; or else another redo might set it back to
blocking in between our call to make it O_NONBLOCK and our call to read().
This method - setting an alarm() during the read - is hacky, but should work
every time. Unfortunately you get a 1s delay - rarely - when this happens.
The good news is it only happens when there are no tokens available anyhow,
so it won't affect performance much in any situation I can imagine.
instead of inside the fork.
Still doesn't seem to affect runtime. Good.
One nice side effect is jwack.py no longer needs to know anything about our
locks.
That way, if everything is locked, we can determine that with a single
token, reducing context switches.
But mostly this is good because the code is simpler.
Now t/curse passes again when parallelized (except for the countall
mismatch, since we haven't fixed the source of that problem yet). At least
it's consistent now.
There's a bunch of stuff rearranged in here, but the actual important
problem was that we were doing unlink() on the lock fifo even if ENXIO,
which meant a reader could connect in between ENXIO and unlink(), and thus
never get notified of the disconnection. This would cause the build to
randomly freeze.
atoi() was getting redundant, and unfortunately we can't easily load
helpers.py in some places where we'd want to, because it depends on vars.py.
So move it to its own module.
We'll have to stop using nonblocking reads, unfortunately. But this seems
to work better than nothing. There's still a race condition that could
theoretically make GNU make angry, unfortunately, since we briefly set the
socket to nonblocking.
But it seems to be pretty unsolvable in the current form; the problem is
that when you're nesting one jwack inside the other and the jobserver is GNU
make, there's no way to tell the parent jwack not to use up a token. Thus,
if you nest too deeply, it just deadlocks.
So this approach isn't really going to work the way it is.