We can't just delete all the dependencies at the beginning and re-add them:
other people might be checking the same dependencies in parallel. Instead,
mark them as delete_me up front, and then after the build completes, remove
only the delete_me entries.
In redo-ifchange, this might be a good idea, since you might just want to
set a dependency on it, so we won't say anything from inside builder.py.
But if you're calling redo.py, that means you expect it to be rebuilt, since
there's no other reason to try. So print a warning.
(This is what make does, more or less.)
...only when running under minimal/do, of course.
The tests in question mostly fail because they're testing particular
dependency-related behaviour, and minimal/do doesn't support dependencies,
so naturally it doesn't work.
Just allow that sub-redo to return an error code. Also, parent redos should
return error code 1, not the same code as the child. That makes it easier
to figure out which file generated the "special" error code.
That makes it a little easier to tell, in a strace, what the process is
waiting on. If it's 100/101, then it's waiting on a token; 50+ means waiting
on a subtask.
Also, we weren't closing the read side of subtask fds on exec. This didn't
cause any problems, but did result in a wasted fd in subprocesses.
If a checksummed target A used to exist but is now missing, and we tried to
redo-ifchange that exact file, we would unnecessarily run 'redo-oob A A';
that is, we have to build A in order to determine if A needs to be built.
The sub-targets of redo-oob aren't run with REDO_UNLOCKED, so this would
deadlock instantly.
Add an assertion to redo-oob to ensure we never try to redo-ifchange the
primary target (thus converting the deadlock into an exception). And skip
doing redo-oob when the target is already the same as the thing we have to
check.
We called 'redo' instead of 'redo-ifchange' on our indeterminate objects.
Since other instances of redo-oob might be running at the same time, this
could cause the same object to get rebuilt more than once unnecessarily.
The unit tests caught this, I just didn't notice earlier.
We were giving up and rebuilding the toplevel object, which did eventually
rebuild our checksummed file, but then the file turned out to be identical
to what it was before, so that nobody *else* who depended on it ended up
getting rebuilt. So the results were indeterminate.
Now we treat it as if its dirtiness is unknown, so we build it using
redo-oob before building any of its dependencies.
If a depends on b depends on c, and c is dirty but b uses redo-stamp
checksums, then 'redo-ifchange a' is indeterminate: we won't know if we need
to run a.do unless we first build b, but the script that *normally* runs
'redo-ifchange b' is a.do, and we don't want to run that yet, because we
don't know for sure if b is dirty, and we shouldn't build a unless one of
its dependencies is dirty. Eek!
Luckily, there's a safe solution. If we *know* a is dirty - eg. because
a.do or one of its children has definitely changed - then we can just run
a.do immediately and there's no problem, even if b is indeterminate, because
we were going to run a.do anyhow.
If a's dependencies are *not* definitely dirty, and all we have is
indeterminate ones like b, then that means a's build process *hasn't
changed*, which means its tree of dependencies still includes b, which means
we can deduce that if we *did* run a.do, it would end up running b.do.
Since we know that anyhow, we can safely just run b.do, which will either
b.set_checked() or b.set_changed(). Once that's done, we can re-parse a's
dependencies and this time conclusively tell if it needs to be redone or
not. Even if it does, b is already up-to-date, so the 'redo-ifchange b'
line in a.do will be fast.
...now take all the above and do it recursively to handle nested
dependencies, etc, and you're done.
We were rebuilding the checksummed file every time because redo-ifchange was
incorrectly assuming that a child's changed_runid that's greater than my
changed_runid means I'm dirty. But if my checked_runid is >= the child's
checked_runid, then I'm clean, because my checksum didn't change.
Clear as mud?
A new redo-stamp program takes whatever you give it as stdin and uses it to
calculate a checksum for the current target. If that checksum is the same
as last time, then we consider the target to be unchanged, and we set
checked_runid and stamp, but leave changed_runid alone. That will make
future callers of redo-ifchange see this target as unmodified.
However, this is only "half" support because by the time we run the .do
script that calls redo-stamp, it's too late; the caller is a dependant of
the stamped program, which is already being rebuilt, even if redo-stamp
turns out to say that this target is unchanged.
The other half is coming up.
This is slightly inelegant, as the old style
echo foo
echo blah
chmod a+x $3
doesn't work anymore; the stuff you wrote to stdout didn't end up in $3.
You can rewrite it as:
exec >$3
echo foo
echo blah
chmod a+x $3
Anyway, it's better this way, because now we can tell the difference between
a zero-length $3 and a nonexistent one. A .do script can thus produce
either one and we'll either delete the target or move the empty $3 to
replace it, whichever is right.
As a bonus, this simplifies our detection of whether you did something weird
with overlapping changes to stdout and $3.
Although we were deadlock-free before, under some circumstances we'd end up
holding a perfectly good token while in sync wait; that would reduce our
parallelism for no good reason. So give back our tokens before waiting for
anybody else.
That way the user can modify an auto-generated 'compile' script, for
example, and it'll stay modified.
If they delete the file, we can then generate it for them again.
Also, we have to warn whenever we're doing this, or people might think it's
a bug.
It's really a separate condition. And since we're not removing the target
*file* in case of error - we update it atomically, and keeping it is better
than losing it - there's no reason to wipe the timestamp in that case
either.
However, we do need to know that the build failed, so that anybody else
(especially in a parallel build) who looks at that target knows that it
died. So add a separate flag just for that.
It creates a race condition: GNU Make might try to read while the socket is
O_NONBLOCK, get EAGAIN, and die; or else another redo might set it back to
blocking in between our call to make it O_NONBLOCK and our call to read().
This method - setting an alarm() during the read - is hacky, but should work
every time. Unfortunately you get a 1s delay - rarely - when this happens.
The good news is it only happens when there are no tokens available anyhow,
so it won't affect performance much in any situation I can imagine.
This should reduce filesystem grinding a bit, and makes the code simpler.
It's also theoretically a bit more portable, since I'm guessing fifo
semantics aren't the same on win32 if we ever get there.
Also, a major problem with the old fifo-based system is that if a redo
process died without cleaning up after itself, it wouldn't delete its
lockfiles, so we had to wipe them all at the beginning of each build. Now
we don't; in theory, you can now have multiple copies of redo poking at the
same tree at the same time and not stepping on each other.
When you have lots of unmodified dependencies, building these printout
strings (which aren't even printed unless you're using -d) ends up taking
something like 5% of the runtime.
It wasn't allowing us to short circuit a dependency if that dependency had
been built previously, but that was already being checked (more correctly)
in dirty_deps().
Just commit when we're about to do something blocking. sqlite goes a lot
faster with bigger transactions. This change does show a small percentage
speedup in tests, but not as much as I'd like.
In flush-cache.sh, we have to do this, because the sqlite3 command-line tool
sets it to zero. Inevitably during parallel testing, it'll end up
contending for a lock, and we really want it to wait a bit.
In state.py, it's not as important since the default is nonzero. But
python-sqlite3's default of 5 seconds makes me a little too nervous; I can
imagine a disk write waiting for more than 5 seconds sometime. So let's use
60 instead.
It passes all tests when run serialized, but still gives weird errors
(OperationalError: database is locked) when run with -j5. sqlite3 shouldn't
be barfing just because the database is locked, since the default timeout is
5 seconds, and it's dying *way* faster than that.
This allows files to transition from generated to not-generated if the .do
file is ever removed (ie. the user is changing things and the file is now a
source file, not a target).