feat: create deployment scripts

2025-11-02 13:09:23 -08:00 · 2025-11-02 13:09:23 -08:00 · 8d5bce4bfb
commit 8d5bce4bfb
parent 78297efe5c
22 changed files with 2697 additions and 74 deletions
--- a/frontend/src/routes/about/index.mdx
+++ b/frontend/src/routes/about/index.mdx
@ -6,21 +6,49 @@ import { Math } from "~/components/math/math"

 # How Salience Works

+A couple of days ago I came across
+[github.com/mattneary/salience](https://github.com/mattneary/salience) by Matt Neary. I thought it
+was quite neat how he took sentence embeddings and in just a few lines of code
+was able to determine the significance of all sentences in a document.
+
+This post is an outsider's view of how salience works. If you're already working with ML models in Python, this will feel
+torturously detailed. I wrote this for the rest of us old world programmers: compilers, networking, systems programming looking at
+C++/Go/Rust, or the poor souls in the frontend Typescript mines.
+For us refugees of the barbarian past, the tooling and notation can look foreign. I wanted to walk through the math and
+numpy operations in detail to show what's actually happening with the data.
+
 Salience highlights important sentences by treating your document as a graph where sentences that talk about similar things are connected. We then figure out which sentences are most "central" to the document's themes.

 ## Step 1: Break Text into Sentences

-We use NLTK's Punkt tokenizer to split text into sentences. This handles tricky cases where simple punctuation splitting fails:
+The first problem we need to solve is finding the sentences in a document. This is not as easy as splitting on newlines or periods. Consider this example:

 *"Dr. Smith earned his Ph.D. in 1995."* ← This is **one** sentence, not three!

-## Step 2: Convert Sentences to Embeddings
+Fortunately, this problem has been adequately solved for decades. We are going to use the **Punkt sentence splitter** (2003) available in the Natural Language Toolkit (NLTK) Python package.

-Now we have <Math tex="N" /> sentences. We convert each one into a high-dimensional vector that captures its meaning:
+## Step 2: Apply an Embedding Model

-<Math display tex="\mathbf{E} = \text{model.encode}(\text{sentences}) \in \mathbb{R}^{N \times D}" />
+Now we have <Math tex="N" /> sentences. We convert each one into a high-dimensional vector that captures its meaning. For example:

-This gives us an **embeddings matrix** <Math tex="\mathbf{E}" /> where each row is one sentence:
+<Math tex="\mathbf{Sentence \space A} = [a_1, a_2, a_3, \ldots, a_D]" display="block" />
+<Math tex="\mathbf{Sentence \space B} = [b_1, b_2, b_3, \ldots, b_D]" display="block" />
+<Math tex="\mathbf{Sentence \space C} = [c_1, c_2, c_3, \ldots, c_D]" display="block" />
+
+## Step 3: Build the Adjacency Matrix
+
+Now we create a new  <Math tex="N \times N" /> adjacency matrix <Math tex="\mathbf{A}" /> that measures how similar each pair of sentences is. For every pair of sentences <Math tex="i" /> and <Math tex="j" />, we need the **cosine similarity**:
+
+<Math display tex="A_{ij} = \frac{\mathbf{e}_i \cdot \mathbf{e}_j}{\|\mathbf{e}_i\| \|\mathbf{e}_j\|}" />
+
+Each <Math tex="A_{ij}" /> represents how strongly sentence <Math tex="i" /> is connected to sentence <Math tex="j" />.
+- <Math tex="A_{ij} = 1" /> means sentences are identical in meaning
+- <Math tex="A_{ij} = 0" /> means sentences are unrelated
+- <Math tex="A_{ij} = -1" /> means sentences are opposite in meaning
+
+
+
+You could work with these embedding vectors one at a time, using two for loops to build the adjacency matrix leetcode style. However, there's a way to delegate the computation to optimized libraries. Instead, organize all embeddings into a single matrix:

 <Math display tex="\mathbf{E} = \begin{bmatrix} a_1 & a_2 & a_3 & \cdots & a_D \\ b_1 & b_2 & b_3 & \cdots & b_D \\ c_1 & c_2 & c_3 & \cdots & c_D \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ z_1 & z_2 & z_3 & \cdots & z_D \end{bmatrix}" />

@ -29,38 +57,59 @@ Where:
 - <Math tex="D" /> = embedding dimension (768 for all-mpnet-base-v2, 1024 for gte-large-en-v1.5)
 - Each row represents one sentence in semantic space

-## Step 3: Build the Adjacency Matrix
+**Step 3a: Compute all dot products**

-Now we create a new matrix <Math tex="\mathbf{A}" /> that measures how similar each pair of sentences is. For every pair of sentences <Math tex="i" /> and <Math tex="j" />, we compute:
+<Math display tex="\mathbf{S} = \mathbf{E} \mathbf{E}^T" />

-<Math display tex="A_{ij} = \frac{\mathbf{e}_i \cdot \mathbf{e}_j}{\|\mathbf{e}_i\| \|\mathbf{e}_j\|}" />
+Since <Math tex="\mathbf{E}" /> is <Math tex="N \times D" /> and <Math tex="\mathbf{E}^T" /> is <Math tex="D \times N" />, their product gives us an <Math tex="N \times N" /> matrix where entry <Math tex="S_{ij} = \mathbf{e}_i \cdot \mathbf{e}_j" />.

-This is the **cosine similarity** between their embedding vectors. It tells us:
- <Math tex="A_{ij} = 1" /> means sentences are identical in meaning
- <Math tex="A_{ij} = 0" /> means sentences are unrelated
- <Math tex="A_{ij} = -1" /> means sentences are opposite in meaning
+**Step 3b: Compute the norms and normalize**
+
+First, compute a vector of norms:
+
+<Math display tex="\mathbf{n} = \begin{bmatrix} \|\mathbf{e}_1\| \\ \|\mathbf{e}_2\| \\ \|\mathbf{e}_3\| \\ \vdots \\ \|\mathbf{e}_N\| \end{bmatrix}" />
+
+This is an <Math tex="(N, 1)" /> vector where each element is the magnitude of one sentence's embedding. Now we need to visit every single element of <Math tex="\mathbf{S}" /> to make the adjacency matrix <Math tex="A_{ij} = \frac{S_{ij}}{n_i \cdot n_j}" />:
+
+<Math display tex="\mathbf{A} = \begin{bmatrix} \frac{S_{11}}{n_1 \cdot n_1} & \frac{S_{12}}{n_1 \cdot n_2} & \cdots & \frac{S_{1N}}{n_1 \cdot n_N} \\ \frac{S_{21}}{n_2 \cdot n_1} & \frac{S_{22}}{n_2 \cdot n_2} & \cdots & \frac{S_{2N}}{n_2 \cdot n_N} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{S_{N1}}{n_N \cdot n_1} & \frac{S_{N2}}{n_N \cdot n_2} & \cdots & \frac{S_{NN}}{n_N \cdot n_N} \end{bmatrix}" />
+
+**Quick benchmark:** For a <Math tex="194 \times 768" /> embeddings matrix (194 sentences):
+
+- Computing everything in Python for loops: **33.1 ms**
+- Using <Math tex="\mathbf{E} \mathbf{E}^T" /> for dot products, but element-by-element normalization in Python: **10.9 ms** (saves 22.2 ms)
+- Using numpy **broadcasting** for normalization too: **0.13 ms**
+
+Broadcasting is a numpy feature where dividing arrays of different shapes automatically "stretches" the smaller array to match:
+
+```python
+def cos_sim(a):
+    sims = a @ a.T
+    norms = np.linalg.norm(a, axis=-1, keepdims=True)
+    sims /= norms      # Divides each row i by norm[i]
+    sims /= norms.T    # Divides each column j by norm[j]
+    return sims
+```
+
+The `keepdims=True` makes `norms` shape <Math tex="(N, 1)" /> instead of <Math tex="(N,)" />, which is crucial—when transposed, <Math tex="(N, 1)" /> becomes <Math tex="(1, N)" />, allowing the broadcasting to work for column-wise division.

-The result is an <Math tex="N \times N" /> **adjacency matrix** where <Math tex="A_{ij}" /> represents how strongly sentence <Math tex="i" /> is connected to sentence <Math tex="j" />.

 ## Step 4: Clean Up the Graph

-We make two adjustments to the adjacency matrix to get a cleaner graph:
+We make two adjustments to the adjacency matrix to make our TextRank work:

 1. **Remove self-loops:** Set diagonal to zero (<Math tex="A_{ii} = 0" />)
-   - A sentence shouldn't vote for its own importance
-
 2. **Remove negative edges:** Set <Math tex="A_{ij} = \max(0, A_{ij})" />
-   - Sentences with opposite meanings get disconnected
+   
+A sentence shouldn't vote for its own importance. And sentences with opposite meanings get disconnected.

-**Important assumption:** This assumes your document has a coherent main idea and that sentences are generally on-topic. We're betting that the topic with the most "semantic mass" is the *correct* topic.
+**Important assumption:** This assumes your document has a coherent main idea and that sentences are generally on-topic. We're betting that the topic with the most "semantic mass" is the *correct* topic. This is obviously not true for many documents:

-**Where this breaks down:**
- **Dialectical essays** that deliberately contrast opposing viewpoints
- **Documents heavy with quotes** that argue against something
- **Debate transcripts** where both sides are equally important
- **Critical analysis** that spends significant time explaining a position before refuting it
+- Dialectical essays that deliberately contrast opposing viewpoints
+- Documents heavy with quotes that argue against something
+- Debate transcripts where both sides are equally important
+- Critical analysis that spends significant time explaining a position before refuting it

-For example: "Nuclear power is dangerous. Critics say it causes meltdowns. However, modern reactors are actually very safe."
+For example: "Nuclear power is dangerous. Critics say it causes meltdowns [...]. However, modern reactors are actually very safe."

 The algorithm might highlight the criticism because multiple sentences cluster around "danger", even though the document's actual position is pro-nuclear. There's nothing inherent in the math that identifies authorial intent vs. quoted opposition.