Suggested by djb in personal email, and on the mailing list. redo-targets
lists all the targets in the database; redo-sources lists all the existing
sources (ie. files that are referred to but which aren't targets).
redo-ifcreate filenames aren't included in the redo-sources list.
The previous method, using fcntl byterange locks, was very efficient and
avoided unnecessarily filesystem metadata churn (ie. creating/deleting
inodes). Unfortunately, MacOS X (at least version 10.6.5) apparently has a
race condition in its fcntl locking that makes it unusably unreliable
(http://apenwarr.ca/log/?m=201012#13).
My tests indicate that if you only ever lock a *single* byterange on a file,
the race condition doesn't cause a problem. So let's just use one lockfile
per target. Now "redo -j20 test" passes for me on both MacOS and Linux.
This doesn't measurably affect the speed on Linux, at least, in my tests.
The bad news: it's hard to safely *delete* those lockfiles when we're done
with them, so they tend to accumulate in the .redo dir.
This could happen if you did 'redo foo foo'. Which nobody ever did, I
think, but let's make sure we catch it if they do.
One problem with having multiple locks on the same file is then you have to
remember not to *unlock* it until they're all done. But there are other
problems, such as: why the heck did we think it was a good idea to lock the
same file more than once? So just prevent it from happening for now,
unless/until we somehow come up with a reason it might be a good idea.
We can't just delete all the dependencies at the beginning and re-add them:
other people might be checking the same dependencies in parallel. Instead,
mark them as delete_me up front, and then after the build completes, remove
only the delete_me entries.
In redo-ifchange, this might be a good idea, since you might just want to
set a dependency on it, so we won't say anything from inside builder.py.
But if you're calling redo.py, that means you expect it to be rebuilt, since
there's no other reason to try. So print a warning.
(This is what make does, more or less.)
If a checksummed target A used to exist but is now missing, and we tried to
redo-ifchange that exact file, we would unnecessarily run 'redo-oob A A';
that is, we have to build A in order to determine if A needs to be built.
The sub-targets of redo-oob aren't run with REDO_UNLOCKED, so this would
deadlock instantly.
Add an assertion to redo-oob to ensure we never try to redo-ifchange the
primary target (thus converting the deadlock into an exception). And skip
doing redo-oob when the target is already the same as the thing we have to
check.
A new redo-stamp program takes whatever you give it as stdin and uses it to
calculate a checksum for the current target. If that checksum is the same
as last time, then we consider the target to be unchanged, and we set
checked_runid and stamp, but leave changed_runid alone. That will make
future callers of redo-ifchange see this target as unmodified.
However, this is only "half" support because by the time we run the .do
script that calls redo-stamp, it's too late; the caller is a dependant of
the stamped program, which is already being rebuilt, even if redo-stamp
turns out to say that this target is unchanged.
The other half is coming up.
That way the user can modify an auto-generated 'compile' script, for
example, and it'll stay modified.
If they delete the file, we can then generate it for them again.
Also, we have to warn whenever we're doing this, or people might think it's
a bug.
It's really a separate condition. And since we're not removing the target
*file* in case of error - we update it atomically, and keeping it is better
than losing it - there's no reason to wipe the timestamp in that case
either.
However, we do need to know that the build failed, so that anybody else
(especially in a parallel build) who looks at that target knows that it
died. So add a separate flag just for that.
This should reduce filesystem grinding a bit, and makes the code simpler.
It's also theoretically a bit more portable, since I'm guessing fifo
semantics aren't the same on win32 if we ever get there.
Also, a major problem with the old fifo-based system is that if a redo
process died without cleaning up after itself, it wouldn't delete its
lockfiles, so we had to wipe them all at the beginning of each build. Now
we don't; in theory, you can now have multiple copies of redo poking at the
same tree at the same time and not stepping on each other.
It wasn't allowing us to short circuit a dependency if that dependency had
been built previously, but that was already being checked (more correctly)
in dirty_deps().
Just commit when we're about to do something blocking. sqlite goes a lot
faster with bigger transactions. This change does show a small percentage
speedup in tests, but not as much as I'd like.
In flush-cache.sh, we have to do this, because the sqlite3 command-line tool
sets it to zero. Inevitably during parallel testing, it'll end up
contending for a lock, and we really want it to wait a bit.
In state.py, it's not as important since the default is nonzero. But
python-sqlite3's default of 5 seconds makes me a little too nervous; I can
imagine a disk write waiting for more than 5 seconds sometime. So let's use
60 instead.
It passes all tests when run serialized, but still gives weird errors
(OperationalError: database is locked) when run with -j5. sqlite3 shouldn't
be barfing just because the database is locked, since the default timeout is
5 seconds, and it's dying *way* faster than that.
This allows files to transition from generated to not-generated if the .do
file is ever removed (ie. the user is changing things and the file is now a
source file, not a target).
The interaction of REDO_STARTDIR, REDO_PWD, and getcwd() are pretty
complicated. In this case, we accidentally assumed that the current
instance of redo was running with getcwd() == REDO_STARTDIR+REDO_PWD, and so
the new target was REDO_STARTDIR+REDO_PWD+t, but this isn't the case if the
current .do script did chdir().
The correct answer is REDO_STARTDIR+getcwd()+t.
If a and b both depend on c, and c is a static (non-generated) file that has
changed since the last successful build of a and b, we would try to redo
a, but would forget to redo b. Now it does both.
We never chdir() except just as we exec a subprocess, so it's okay to cache
this value. This makes strace output look cleaner, and speeds things up a
little bit when checking a large number of dependencies.
Relatedly, take a debug2() message and put it in an additional if, so that
we don't have to do so much work to calculate it when we're just going to
throw it away anyhow.
...because we deliberately stamp non-generated files as well, and that
doesn't need to imply that we rebuilt them just now. In fact, we know for a
fact that we *didn't* rebuild them just now, but we still need to record the
timestamp for later.
If 'redo clean' deletes the lockfile after trylock() succeeds but before
unlock(), then unlock() won't be able to open the pipe in order to release
readers, and any waiters might end up waiting forever.
We can't open the fifo for write until there's at least one reader, so let's
open a reader *just* to let us open a writer. Then we'll leave them open
until the later unlock(), which can just close them both.
If someone else built and marked one of our dependencies, then that
dependency would show up as *clean* in a later redo-ifchange, so other
dependents of that file wouldn't be rebuilt.
We actually have to track two session-specific variables: whether the file
has been checked, and whether it was rebuilt. (Or alternatively, whether it
was dirty when we checked it the first time. But we store the former.)
If a depends on b depends on c, then if when we consider building a, we have
to check b and c. If we then are asked about a2 which depends on b, there
is no reason to re-check b and its dependencies; we already know it's done.
This takes the time to do 'redo t/curse/all' the *second* time down from
1.0s to 0.13s. (make can still do it in 0.07s.)
'redo t/curse/all' the first time is down from 5.4s to to 4.6s. With -j4,
from 3.0s to 2.5s.
Now t/curse passes again when parallelized (except for the countall
mismatch, since we haven't fixed the source of that problem yet). At least
it's consistent now.
There's a bunch of stuff rearranged in here, but the actual important
problem was that we were doing unlink() on the lock fifo even if ENXIO,
which meant a reader could connect in between ENXIO and unlink(), and thus
never get notified of the disconnection. This would cause the build to
randomly freeze.