jobserver: don't release the very last token in wait_all().

After waiting for children to exit, we would release our own token, and then the caller would immediately try to obtain a token again. This accounted for tokens correctly, but would pass tokens around the call tree in unexpected ways. For example, imagine we had only one token. We call 'redo a1 a2', and a1 calls 'redo b1 b2', and b1 calls 'redo c1'. When c1 exits, it releases its token, then tries to re-acquire it before exiting. This also includes 'redo b1 b2' and 'redo a1 a2' in the race for the token, which means b1 might get suspended while *either* a2 or b2 starts running. This never caused a deadlock, even if a2 or b2 depends on b1, because if they tried to build b1, they would notice it is locked, give up their token, and wait for the lock. c1 (and then b1) could then obtain the token and immediately terminate, allowing progress to continue. But this is not really the way we expect things to happen. "Obviously" what we want here is a straightforward stack unwinding: c1 should finish, then b1, then b2, then a1, then b2. The not-very-obvious symptom of this bug is that redo's unit tests seemed to run in the wrong order when using -j1 --no-log. (--log would hide the problem by rearranging logs back into the right order!)
2018-12-31 16:53:13 -05:00 · 2018-12-31 16:53:13 -05:00 · e247a72300
commit e247a72300
parent 22dd0cdd6b
2 changed files with 20 additions and 14 deletions
--- a/redo/builder.py
+++ b/redo/builder.py
@ -550,7 +550,7 @@ def run(targets, shouldbuildfunc):
    while locked or jobserver.running():
        state.commit()
        jobserver.wait_all()
-        assert jobserver._mytokens == 0  # pylint: disable=protected-access
+        assert jobserver._mytokens <= 1  # pylint: disable=protected-access
        jobserver.ensure_token_or_cheat('self', cheat)
        # at this point, we don't have any children holding any tokens, so
        # it's okay to block below.