lseq/research/ALGORITHM_EXPLANATION.md

6.3 KiB

LSEQ Base64: A Continued Fraction System for Distributed Identifiers

This is a continued fraction system.

You have identifiers like [1, 2, 52] that represent positions in a mixed-radix number system where each digit position has exponentially more capacity: 64, 64², 64³, etc. When you need to insert between two identifiers, you're doing arithmetic in this variable-base system until you find a digit position with enough space.

The Core Problem

You want to insert a new identifier between p = [1,2,52] and q = [1,2,53]. These are adjacent at the last digit, so there's no room. What do you do?

Answer: Extend to the next digit position, which has 16.7 million slots.

This is exactly like having the decimal numbers 1252 and 1253, realizing they're adjacent, and extending to 12520 through 12529 to find space between them. Except our "decimal" system has bases 64, 64², 64³, etc.

The Algorithm

Walk down both identifiers digit by digit, building the result as you go:

function alloc(self, p, q):
    depth = 0
    result = Vec::with_capacity(max(p.len(), q.len()) + 1)
    borrow_flag = false
    
    interval = 0
   
    while depth < result.len():
        if self.strategies.len() < depth:
            self.strategies.push(random(bool))

        p_val = p[depth] if depth < p.len() else 0
        q_val =
            if carry_flag:
                max_value_at_depth
            else if q[depth] if depth < q.len():
                q[depth]
            else
                0
        
        if p_val == q_val:
            result[depth] = p_val
            depth += 1  // Same value, continue deeper
            continue
            
        if q_val - p_val > 1:
            // Enough space at this level
            interval = q_val - p_val - 1

            if self.strategies[depth]:
                // add to p
                result[depth] = p_val + random(1, min(BOUNDARY, interval))
            else:
                // subtract from q
                result[depth] = q_val - random(1, min(BOUNDARY, interval))
            break
        else:
            // q_val - p_val == 1, not enough space, go deeper one level
            result[depth] = p_val
            depth += 1
            borrow_flag = true
    
    return result[0..=depth]

The key insights:

  • Pre-allocate result: We know the maximum possible depth upfront
  • Borrow flag: When there's no space at a depth, we set a borrow flag that affects how we interpret missing digits in the next level
  • Strategy array: Each depth has a persistent strategy (boundary+ or boundary-) to prevent clustering
  • Boundary limiting: Use a BOUNDARY constant to limit random selection and improve distribution

Why This Works

Guaranteed space: Each level has exponentially more capacity (64^(level+1) slots), so you'll always find space eventually.

Total ordering: The lexicographic ordering of vectors gives you a consistent sort order.

No coordination: Two nodes can independently pick identifiers without talking to each other.

Concrete Example

p = [1,2,52], q = [1,2,53]

  • Depth 0: p_val = 1, q_val = 1 → same value, so result[0] = 1, continue deeper
  • Depth 1: p_val = 2, q_val = 2 → same value, so result[1] = 2, continue deeper
  • Depth 2: p_val = 52, q_val = 53q_val - p_val = 1, no space (≤ 1), so result[2] = 52, set borrow_flag = true, continue deeper
  • Depth 3: p_val = 0 (past end of p), q_val = max_value_at_depth (because borrow_flag is true) → huge interval available!

Now we have space: interval = max_value_at_depth - 0 - 1. Check the strategy at depth 3:

  • If strategies[3] = true (boundary+): result[3] = 0 + random(1, min(BOUNDARY, interval))
  • If strategies[3] = false (boundary-): result[3] = max_value_at_depth - random(1, min(BOUNDARY, interval))

Return [1,2,52,chosen_value].

The "Borrowing" (Borrow Flag)

When there's no space at depth 2 (q_val - p_val = 1), we set borrow_flag = true. This affects how we interpret missing digits in the next level:

  • Without borrow flag: missing digit in q becomes 0
  • With borrow flag: missing digit in q becomes max_value_at_depth

Why? Because when we couldn't fit at depth 2, we're now looking for space between:

  • [1,2,52,0...] (p extended)
  • [1,2,52,max_value...] (q "borrowed down")

Continued fraction borrowing direction: Since our array represents continued fraction numerators from most significant to least significant, we're borrowing from the more significant position (earlier in the array at depth 2) to create space at the less significant position (later in the array at depth 3).

This is like decimal borrowing, but in reverse array order: when looking between 1252 and 1253, we actually search between 1252 and 1252.999... The borrow flag tells us we're in this "borrowed" state where the more significant digit has lent capacity to the less significant position.

Edge Cases

Adjacent values: Handled by extending to the next depth.

Maximum values: If p[depth] is already at max, extending still works because the next depth has way more capacity.

Empty inputs: p = [] becomes [0], q = [] becomes [max_at_depth_0].

Why "Continued Fraction"?

Each digit position has a different base (64¹, 64², 64³, ...), and you're doing arithmetic across these variable-capacity positions. This is the defining characteristic of continued fractions and mixed-radix systems.

The tree visualization is just a way to think about it, but fundamentally you're doing arithmetic in a number system where carrying/borrowing happens between positions with different capacities.

Implementation Details

The actual code builds the result vector directly as it traverses both identifiers simultaneously. Key implementation points:

  • Pre-allocated result: We know the maximum depth upfront: max(p.len(), q.len()) + 1
  • Strategy persistence: Each depth has a persistent random strategy (boundary+ or boundary-) stored in self.strategies
  • Borrow flag mechanics: When q_val - p_val = 1, we subtract one from the q_val and set borrow_flag for the next level (which is same as taking p_val)
  • Boundary limiting: Use min(BOUNDARY, interval) to limit random selection and improve distribution

The strategy selection prevents clustering and ensures good distribution of identifiers over time.