salience-editor/frontend/src/routes/about/index.mdx

213 lines
13 KiB
Text

---
title: How Salience Works
---
import { Math } from "~/components/math/math"
# How Salience Works
A couple of days ago I came across
[github.com/mattneary/salience](https://github.com/mattneary/salience) by Matt Neary. I thought it
was quite neat how he took sentence embeddings and in just a few lines of code
was able to determine the significance of all sentences in a document.
This post is an outsider's view of how salience works. If you're already working with ML models in Python, this will feel
torturously detailed. I wrote this for the rest of us old world programmers: compilers, networking, systems programming looking at
C++/Go/Rust, or the poor souls in the frontend Typescript mines.
For us refugees of the barbarian past, the tooling and notation can look foreign. I wanted to walk through the math and
numpy operations in detail to show what's actually happening with the data.
Salience highlights important sentences by treating your document as a graph where sentences that talk about similar things are connected. We then figure out which sentences are most "central" to the document's themes.
## Step 1: Break Text into Sentences
The first problem we need to solve is finding the sentences in a document. This is not as easy as splitting on newlines or periods. Consider this example:
*"Dr. Smith earned his Ph.D. in 1995."* ← This is **one** sentence, not three!
Fortunately, this problem has been adequately solved for decades. We are going to use the **Punkt sentence splitter** (2003) available in the Natural Language Toolkit (NLTK) Python package.
## Step 2: Apply an Embedding Model
Now we have <Math tex="N" /> sentences. We convert each one into a high-dimensional vector that captures its meaning. For example:
<Math tex="\mathbf{Sentence \space A} = [a_1, a_2, a_3, \ldots, a_D]" display="block" />
<Math tex="\mathbf{Sentence \space B} = [b_1, b_2, b_3, \ldots, b_D]" display="block" />
<Math tex="\mathbf{Sentence \space C} = [c_1, c_2, c_3, \ldots, c_D]" display="block" />
## Step 3: Build the Adjacency Matrix
Now we create a new <Math tex="N \times N" /> adjacency matrix <Math tex="\mathbf{A}" /> that measures how similar each pair of sentences is. For every pair of sentences <Math tex="i" /> and <Math tex="j" />, we need the **cosine similarity**:
<Math display tex="A_{ij} = \frac{\mathbf{e}_i \cdot \mathbf{e}_j}{\|\mathbf{e}_i\| \|\mathbf{e}_j\|}" />
Each <Math tex="A_{ij}" /> represents how strongly sentence <Math tex="i" /> is connected to sentence <Math tex="j" />.
- <Math tex="A_{ij} = 1" /> means sentences are identical in meaning
- <Math tex="A_{ij} = 0" /> means sentences are unrelated
- <Math tex="A_{ij} = -1" /> means sentences are opposite in meaning
You could work with these embedding vectors one at a time, using two for loops to build the adjacency matrix leetcode style. However, there's a way to delegate the computation to optimized libraries. Instead, organize all embeddings into a single matrix:
<Math display tex="\mathbf{E} = \begin{bmatrix} a_1 & a_2 & a_3 & \cdots & a_D \\ b_1 & b_2 & b_3 & \cdots & b_D \\ c_1 & c_2 & c_3 & \cdots & c_D \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ z_1 & z_2 & z_3 & \cdots & z_D \end{bmatrix}" />
Where:
- <Math tex="N" /> = number of sentences (rows)
- <Math tex="D" /> = embedding dimension (768 for all-mpnet-base-v2, 1024 for gte-large-en-v1.5)
- Each row represents one sentence in semantic space
**Step 3a: Compute all dot products**
<Math display tex="\mathbf{S} = \mathbf{E} \mathbf{E}^T" />
Since <Math tex="\mathbf{E}" /> is <Math tex="N \times D" /> and <Math tex="\mathbf{E}^T" /> is <Math tex="D \times N" />, their product gives us an <Math tex="N \times N" /> matrix where entry <Math tex="S_{ij} = \mathbf{e}_i \cdot \mathbf{e}_j" />.
**Step 3b: Compute the norms and normalize**
First, compute a vector of norms:
<Math display tex="\mathbf{n} = \begin{bmatrix} \|\mathbf{e}_1\| \\ \|\mathbf{e}_2\| \\ \|\mathbf{e}_3\| \\ \vdots \\ \|\mathbf{e}_N\| \end{bmatrix}" />
This is an <Math tex="(N, 1)" /> vector where each element is the magnitude of one sentence's embedding. Now we need to visit every single element of <Math tex="\mathbf{S}" /> to make the adjacency matrix <Math tex="A_{ij} = \frac{S_{ij}}{n_i \cdot n_j}" />:
<Math display tex="\mathbf{A} = \begin{bmatrix} \frac{S_{11}}{n_1 \cdot n_1} & \frac{S_{12}}{n_1 \cdot n_2} & \cdots & \frac{S_{1N}}{n_1 \cdot n_N} \\ \frac{S_{21}}{n_2 \cdot n_1} & \frac{S_{22}}{n_2 \cdot n_2} & \cdots & \frac{S_{2N}}{n_2 \cdot n_N} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{S_{N1}}{n_N \cdot n_1} & \frac{S_{N2}}{n_N \cdot n_2} & \cdots & \frac{S_{NN}}{n_N \cdot n_N} \end{bmatrix}" />
**Quick benchmark:** For a <Math tex="194 \times 768" /> embeddings matrix (194 sentences):
- Computing everything in Python for loops: **33.1 ms**
- Using <Math tex="\mathbf{E} \mathbf{E}^T" /> for dot products, but element-by-element normalization in Python: **10.9 ms** (saves 22.2 ms)
- Using numpy **broadcasting** for normalization too: **0.13 ms**
Broadcasting is a numpy feature where dividing arrays of different shapes automatically "stretches" the smaller array to match:
```python
def cos_sim(a):
sims = a @ a.T
norms = np.linalg.norm(a, axis=-1, keepdims=True)
sims /= norms # Divides each row i by norm[i]
sims /= norms.T # Divides each column j by norm[j]
return sims
```
The `keepdims=True` makes `norms` shape <Math tex="(N, 1)" /> instead of <Math tex="(N,)" />, which is crucial—when transposed, <Math tex="(N, 1)" /> becomes <Math tex="(1, N)" />, allowing the broadcasting to work for column-wise division.
## Step 4: Clean Up the Graph
We make two adjustments to the adjacency matrix to make our TextRank work:
1. **Remove self-loops:** Set diagonal to zero (<Math tex="A_{ii} = 0" />)
2. **Remove negative edges:** Set <Math tex="A_{ij} = \max(0, A_{ij})" />
A sentence shouldn't vote for its own importance. And sentences with opposite meanings get disconnected.
**Important assumption:** This assumes your document has a coherent main idea and that sentences are generally on-topic. We're betting that the topic with the most "semantic mass" is the *correct* topic. This is obviously not true for many documents:
- Dialectical essays that deliberately contrast opposing viewpoints
- Documents heavy with quotes that argue against something
- Debate transcripts where both sides are equally important
- Critical analysis that spends significant time explaining a position before refuting it
For example: "Nuclear power is dangerous. Critics say it causes meltdowns [...]. However, modern reactors are actually very safe."
The algorithm might highlight the criticism because multiple sentences cluster around "danger", even though the document's actual position is pro-nuclear. There's nothing inherent in the math that identifies authorial intent vs. quoted opposition.
**Bottom line:** This technique works well for coherent, single-perspective documents. It can fail when multiple competing viewpoints have similar semantic weight.
## Step 5: Normalize the Adjacency Matrix
The idea from **TextRank** is to treat similarity as a graph problem: simulate random walks and see where you're likely to end up. Sentences you frequently visit are important.
But first, we need to compute the **degree matrix** <Math tex="\mathbf{D}" />. This tells us how "connected" each sentence is:
<Math display tex="\mathbf{D} = \text{diag}(\mathbf{A} \mathbf{1})" />
Here's what this means:
- <Math tex="\mathbf{A} \mathbf{1}" /> means "sum up each row of <Math tex="\mathbf{A}" />"
- For sentence <Math tex="i" />, this gives us <Math tex="d_i = \sum_j A_{ij}" /> (the total similarity to all other sentences)
- <Math tex="\text{diag}(...)" /> puts these sums on the diagonal of a matrix
The result is a diagonal matrix that looks like:
<Math display tex="\mathbf{D} = \begin{bmatrix} d_1 & 0 & 0 & \cdots & 0 \\ 0 & d_2 & 0 & \cdots & 0 \\ 0 & 0 & d_3 & \cdots & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & 0 & \cdots & d_N \end{bmatrix}" />
**Intuition:** A sentence with high degree (<Math tex="d_i" /> is large) is connected to many other sentences or has strong connections. A sentence with low degree is more isolated.
Now we use <Math tex="\mathbf{D}" /> to normalize <Math tex="\mathbf{A}" />. There are two approaches:
Traditional normalization <Math tex="\mathbf{D}^{-1} \mathbf{A}" />:
- This creates a row-stochastic matrix (rows sum to 1)
- Interpretation: "If I'm at sentence <Math tex="i" />, what's the probability of jumping to sentence <Math tex="j" />?"
- This is like a proper Markov chain transition matrix
- Used in standard PageRank and TextRank
Spectral normalization <Math tex="\mathbf{D}^{-1/2} \mathbf{A} \mathbf{D}^{-1/2}" />:
- Used in spectral clustering and graph analysis
- Symmetry preservation: if A is symmetric (which cosine similarity matrix is), then the normalized version
stays symmetric
- The eigenvalues are bounded in [-1, 1]
- More uniform influence from all neighbors
- Better numerical properties for exponentiation
The traditional <Math tex="\mathbf{D}^{-1} \mathbf{A}" /> approach introduces potential node bias and lacks symmetry. Spectral normalization
provides a more balanced representation by symmetrizing the adjacency matrix and ensuring more uniform
neighbor influence. This method prevents high-degree nodes from dominating the graph's structure, creating a
more equitable information propagation mechanism.
With traditional normalization, sentences with many connections get their influence diluted. A sentence connected to 10 others splits its "voting power" into 10 pieces. A sentence connected to 2 others splits its power into just 2 pieces. This creates a bias against well-connected sentences.
Spectral normalization treats the graph as **undirected**, which matches how
semantic similarity works. Well-connected sentences keep their influence
proportional to connectivity. Two sentences that are similar to each other
should have equal influence on each other, not asymmetric transition
probabilities.
## Step 6: Random Walk Simulation
We simulate importance propagation by raising the normalized matrix to a power:
<Math display tex="\mathbf{s} = \mathbf{1}^T \tilde{\mathbf{A}}^k" />
Where:
- <Math tex="\mathbf{1}" /> = vector of ones (start with equal weight on all sentences)
- <Math tex="k" /> = random walk length (default: 5)
- <Math tex="\mathbf{s}" /> = raw salience scores for each sentence
**Intuition:** After <Math tex="k" /> steps of random walking through the similarity graph, which sentences have we visited most? Those are the central, important sentences.
## Step 7: Map Scores to Highlight Colors
Now we have a vector of raw salience scores from the random walk. Problem: these scores have no physical meaning. Different embedding models produce wildly different ranges:
- Model A on Doc 1: `[0.461, 1.231]`
- Model B on Doc 2: `[0.892, 1.059]`
We need to turn this vector of arbitrary numbers into CSS highlight opacities in `[0, 1]`. Here's the reasoning behind creating the remapping function:
I could do trivial linear scaling - multiply by a constant to get scores into some range like <Math tex="X" /> to <Math tex="X + 2" />. But let's try to make the top sentences stand out more. One trick: exponentiation. Since human perception of brightness is not linear, exponentiation will preserve order but push the top values apart more. It makes the top few sentences really pop out.
**Building the remapping function**
Given a salience vector <Math tex="\mathbf{s}" /> with values ranging from <Math tex="\min(\mathbf{s})" /> to <Math tex="\max(\mathbf{s})" />:
1. **Find an exponent** <Math tex="p" /> such that <Math tex="\max(\mathbf{s}^p) \approx \min(\mathbf{s}^p) + 2" />
Sure, it takes more work to find the right exponent for our target spread of 2, but that's still easy with a simple solver.
2. **Find a threshold** <Math tex="\tau" /> such that 50% of the sentences get clamped to zero.
Since I'm using this for editing documents, I only want to see highlights on roughly half the sentences—the important half.
The final opacity mapping is:
<Math display tex="\text{opacity}_i = \text{clamp}\left(s_i^p - \tau, 0, 1\right)" />
For each document, I use a simple 1D solver to find <Math tex="p" /> and <Math tex="\tau" /> that satisfy these constraints.
**Final thought:** This last step—converting the output from TextRank into highlight colors—is the weakest part of the system. I have no idea if it's actually correct or whether it even allows meaningful comparison between different embedding models. It works well enough for the intended purpose (quickly seeing which sentences to keep when editing), but the numerical values themselves are essentially arbitrary.
---
[← Back to App](/)