This includes a fairly detailed test of various known shell bugs from the
autoconf docs.
The idea here is that if redo works on your system, you should be able to
rely on a *good* shell to run your .do files; you shouldn't have to work
around zillions of bugs like autoconf does.
Previously, we would only search for default*.do in the same directory in
the target; now we search parent directories as well.
Let's say we're in a/b/ and trying to build foo.o. If we find
../../default.o.do, then we'll run
cd ../..; sh default.o.do a/b/foo .o $TMPNAME
In other words, we still always chdir to the same directory as the .do file.
But now $1 might have a path in it, not just a basename.
The previous method, using fcntl byterange locks, was very efficient and
avoided unnecessarily filesystem metadata churn (ie. creating/deleting
inodes). Unfortunately, MacOS X (at least version 10.6.5) apparently has a
race condition in its fcntl locking that makes it unusably unreliable
(http://apenwarr.ca/log/?m=201012#13).
My tests indicate that if you only ever lock a *single* byterange on a file,
the race condition doesn't cause a problem. So let's just use one lockfile
per target. Now "redo -j20 test" passes for me on both MacOS and Linux.
This doesn't measurably affect the speed on Linux, at least, in my tests.
The bad news: it's hard to safely *delete* those lockfiles when we're done
with them, so they tend to accumulate in the .redo dir.
This comes down to the lack of a 'seq' command (what?!) and the fact that
BSD "wc -l" returns extra whitespace, while the GNU version doesn't. We
should be using numeric comparisons instead of string comparisons, and then
it's ok.
...only when running under minimal/do, of course.
The tests in question mostly fail because they're testing particular
dependency-related behaviour, and minimal/do doesn't support dependencies,
so naturally it doesn't work.
If a checksummed target A used to exist but is now missing, and we tried to
redo-ifchange that exact file, we would unnecessarily run 'redo-oob A A';
that is, we have to build A in order to determine if A needs to be built.
The sub-targets of redo-oob aren't run with REDO_UNLOCKED, so this would
deadlock instantly.
Add an assertion to redo-oob to ensure we never try to redo-ifchange the
primary target (thus converting the deadlock into an exception). And skip
doing redo-oob when the target is already the same as the thing we have to
check.
We were giving up and rebuilding the toplevel object, which did eventually
rebuild our checksummed file, but then the file turned out to be identical
to what it was before, so that nobody *else* who depended on it ended up
getting rebuilt. So the results were indeterminate.
Now we treat it as if its dirtiness is unknown, so we build it using
redo-oob before building any of its dependencies.
If a depends on b depends on c, and c is dirty but b uses redo-stamp
checksums, then 'redo-ifchange a' is indeterminate: we won't know if we need
to run a.do unless we first build b, but the script that *normally* runs
'redo-ifchange b' is a.do, and we don't want to run that yet, because we
don't know for sure if b is dirty, and we shouldn't build a unless one of
its dependencies is dirty. Eek!
Luckily, there's a safe solution. If we *know* a is dirty - eg. because
a.do or one of its children has definitely changed - then we can just run
a.do immediately and there's no problem, even if b is indeterminate, because
we were going to run a.do anyhow.
If a's dependencies are *not* definitely dirty, and all we have is
indeterminate ones like b, then that means a's build process *hasn't
changed*, which means its tree of dependencies still includes b, which means
we can deduce that if we *did* run a.do, it would end up running b.do.
Since we know that anyhow, we can safely just run b.do, which will either
b.set_checked() or b.set_changed(). Once that's done, we can re-parse a's
dependencies and this time conclusively tell if it needs to be redone or
not. Even if it does, b is already up-to-date, so the 'redo-ifchange b'
line in a.do will be fast.
...now take all the above and do it recursively to handle nested
dependencies, etc, and you're done.
We were rebuilding the checksummed file every time because redo-ifchange was
incorrectly assuming that a child's changed_runid that's greater than my
changed_runid means I'm dirty. But if my checked_runid is >= the child's
checked_runid, then I'm clean, because my checksum didn't change.
Clear as mud?
A new redo-stamp program takes whatever you give it as stdin and uses it to
calculate a checksum for the current target. If that checksum is the same
as last time, then we consider the target to be unchanged, and we set
checked_runid and stamp, but leave changed_runid alone. That will make
future callers of redo-ifchange see this target as unmodified.
However, this is only "half" support because by the time we run the .do
script that calls redo-stamp, it's too late; the caller is a dependant of
the stamped program, which is already being rebuilt, even if redo-stamp
turns out to say that this target is unchanged.
The other half is coming up.
This is slightly inelegant, as the old style
echo foo
echo blah
chmod a+x $3
doesn't work anymore; the stuff you wrote to stdout didn't end up in $3.
You can rewrite it as:
exec >$3
echo foo
echo blah
chmod a+x $3
Anyway, it's better this way, because now we can tell the difference between
a zero-length $3 and a nonexistent one. A .do script can thus produce
either one and we'll either delete the target or move the empty $3 to
replace it, whichever is right.
As a bonus, this simplifies our detection of whether you did something weird
with overlapping changes to stdout and $3.
That way the user can modify an auto-generated 'compile' script, for
example, and it'll stay modified.
If they delete the file, we can then generate it for them again.
Also, we have to warn whenever we're doing this, or people might think it's
a bug.
It's really a separate condition. And since we're not removing the target
*file* in case of error - we update it atomically, and keeping it is better
than losing it - there's no reason to wipe the timestamp in that case
either.
However, we do need to know that the build failed, so that anybody else
(especially in a parallel build) who looks at that target knows that it
died. So add a separate flag just for that.
In flush-cache.sh, we have to do this, because the sqlite3 command-line tool
sets it to zero. Inevitably during parallel testing, it'll end up
contending for a lock, and we really want it to wait a bit.
In state.py, it's not as important since the default is nonzero. But
python-sqlite3's default of 5 seconds makes me a little too nervous; I can
imagine a disk write waiting for more than 5 seconds sometime. So let's use
60 instead.
It passes all tests when run serialized, but still gives weird errors
(OperationalError: database is locked) when run with -j5. sqlite3 shouldn't
be barfing just because the database is locked, since the default timeout is
5 seconds, and it's dying *way* faster than that.
If a and b both depend on c, and c is a static (non-generated) file that has
changed since the last successful build of a and b, we would try to redo
a, but would forget to redo b. Now it does both.
If a file previously was generated but now isn't (ie. its .do file
disappears), we would never re-stamp that target, and so all its
dependencies would rebuild continually.
We would build 'somefile' correctly the first time, but we wouldn't
attach the dependency on somefile to the right $TARGET, so our target would
not auto-rebuild in the future based on somefile.
It actually decreases readability of the .do files - by not making it
explicit when you're going into a subdir.
Plus it adds ambiguity: what if there's a dirname.do *and* a dirname/all?
We could resolve the ambiguity if we wanted, but that adds more code, while
taking out this special case makes *less* code and improves readability.
I think it's the right way to go.
Normally, creating the target $1 yourself is bad; create $3 instead. But if
$1 is a directory, we'll allow it. That way 'redo subdir' can call
subdir.do, and subdir.do can both create the directory *and* run a bunch of
sub-.do files on it.
We had a bug (fixed in the previous commit) where doing 'redo-ifchange
dirname' (which runs dirname/all.do) would not create the stamp correctly,
so that it would always show up as dirty.
It's a little bit complicated to simulate, but this does it.
Unfortunately it failed before the previous patch, so that's why this test
is needed :(
The test is a little ugly, because the bug I'm testing for didn't happen
except if you ran 'redo' two times in a row, not two times inside the same
redo session. That's because dependency caching inside the one session
prevents the accidental rebuild.
.do files should never modify $1, and should write to *either* $3 or stdout,
but not both. If they write to both, it's probably because they forgot to
redirect stdout to stderr, a very easy mistake to make but a hard one to
detect.
Now redo detects it for you and prints an informative message.
Now 'redo test' runs the tests, but 'redo t' just builds the programs.
Also removed wvtest stuff; we're not really using it properly anyway and
it's not helping our testing right now. It might come back later.
redo: 5.4s
redo -j4: 3.0s
make: 2.3s
make -j4: 1.4s
make SHELL=/bin/dash: 1.2s
make SHELL=/bin/dash -j4: 0.83s
We have some distance to go yet. Of course, redo is still written in
python, not C, so it's very expensive, and the on-disk dependency store is
very inefficient.
Now t/curse passes again when parallelized (except for the countall
mismatch, since we haven't fixed the source of that problem yet). At least
it's consistent now.
There's a bunch of stuff rearranged in here, but the actual important
problem was that we were doing unlink() on the lock fifo even if ENXIO,
which meant a reader could connect in between ENXIO and unlink(), and thus
never get notified of the disconnection. This would cause the build to
randomly freeze.
...because it seems my locking isn't very good. It exposes annoying
problems involving rebuilding the same files more than once, screwing up
stamp files with redo -j, and being unnecessarily slow when checking
dependencies. So it's a pretty good test considering how simple it is.
Didn't add it to t/all.do yet, because it would fail.