Workaround for completely broken file locking on Windows 10 WSL.

WSL (Windows Services for Linux) provides a Linux-kernel-compatible ABI for userspace processes, but the current version doesn't not implement fcntl() locks at all; it just always returns success. See https://github.com/Microsoft/WSL/issues/1927. This causes us three kinds of problem: 1. sqlite3 in WAL mode gives "OperationalError: locking protocol". 1b. Other sqlite3 journal modes also don't work when used by multiple processes. 2. redo parallelism doesn't work, because we can't prevent the same target from being build several times simultaneously. 3. "redo-log -f" doesn't work, since it can't tell whether the log file it's tailing is "done" or not. To fix #1, we switch the sqlite3 journal back to PERSIST instead of WAL. We originally changed to WAL in commit 5156feae9d to reduce deadlocks on MacOS. That was never adequately explained, but PERSIST still acts weird on MacOS, so we'll only switch to PERSIST when we detect that locking is definitely broken. Sigh. To (mostly) fix #2, we disable any -j value > 1 when locking is broken. This prevents basic forms of parallelism, but doesn't stop you from re-entrantly starting other instances of redo. To fix that properly, we need to switch to a different locking mechanism entirely, which is tough in python. flock() locks probably work, for example, but python's locks lie and just use fcntl locks for those. To fix #3, we always force --no-log mode when we find that locking is broken.
2019-01-02 14:18:51 -05:00 · 2019-01-02 14:18:51 -05:00 · 61f3e4672e
commit 61f3e4672e
parent 613fcb1c34
18 changed files with 122 additions and 34 deletions
--- a/redo/state.py
+++ b/redo/state.py
@ -17,7 +17,12 @@ LOG_LOCK_MAGIC = 0x10000000  # fid offset for "log locks"
 def _connect(dbfile):
    _db = sqlite3.connect(dbfile, timeout=TIMEOUT)
    _db.execute("pragma synchronous = off")
-    _db.execute("pragma journal_mode = WAL")
+    # Some old/broken versions of pysqlite on MacOS work badly with journal
+    # mode PERSIST.  But WAL fails on Windows WSL due to WSL's totally broken
+    # locking.  On WSL, at least PERSIST works in single-threaded mode, so
+    # if we're careful we can use it, more or less.
+    jmode = 'PERSIST' if env.v.LOCKS_BROKEN else 'WAL'
+    _db.execute("pragma journal_mode = %s" % (jmode,))
    _db.text_factory = str
    return _db

@ -50,6 +55,8 @@ def db():
    _lockfile = os.open(os.path.join(env.v.BASE, '.redo/locks'),
                        os.O_RDWR | os.O_CREAT, 0666)
    close_on_exec(_lockfile, True)
+    if env.is_toplevel and detect_broken_locks():
+        env.mark_locks_broken()

    must_create = not os.path.exists(dbfile)
    if not must_create:
@ -110,6 +117,8 @@ def db():
 def init(targets):
    env.init(targets)
    db()
+    if env.is_toplevel and detect_broken_locks():
+        env.mark_locks_broken()


 _wrote = 0
@ -530,3 +539,46 @@ class Lock(object):
                            % self.fid)
        fcntl.lockf(_lockfile, fcntl.LOCK_UN, 1, self.fid)
        self.owned = False
+
+
+def detect_broken_locks():
+    """Detect Windows WSL's completely broken fcntl() locks.
+
+    Symptom: locking a file always returns success, even if other processes
+    also think they have it locked. See
+    https://github.com/Microsoft/WSL/issues/1927 for more details.
+
+    Bug exists at least in WSL "4.4.0-17134-Microsoft #471-Microsoft".
+
+    Returns true if broken, false otherwise.
+    """
+    pl = Lock(0)
+    # We wait for the lock here, just in case others are doing
+    # this test at the same time.
+    pl.waitlock(shared=False)
+    pid = os.fork()
+    if pid:
+        # parent
+        _, rv = os.waitpid(pid, 0)
+        ok = os.WIFEXITED(rv) and not os.WEXITSTATUS(rv)
+        return not ok
+    else:
+        # child
+        try:
+            # Doesn't actually unlock, since child process doesn't own it
+            pl.unlock()
+            del pl
+            cl = Lock(0)
+            # parent is holding lock, which should prevent us from getting it.
+            owned = cl.trylock()
+            if owned:
+                # Got the lock? Yikes, the locking system is broken!
+                os._exit(1)
+            else:
+                # Failed to get the lock? Good, the parent owns it.
+                os._exit(0)
+        except Exception:  # pylint: disable=broad-except
+            import traceback
+            traceback.print_exc()
+        finally:
+            os._exit(99)