How Salience Works

A couple of days ago I came across github.com/mattneary/salience by Matt Neary. I thought it was quite neat how someone well armed with math can take sentence embeddings and determine the significance of all sentences in a document in fewer lines of code than my introduction paragraph here.

Salience highlights important sentences by treating your document as a graph where sentences that talk about similar things are connected. We then figure out which sentences are most "central" to the document's themes.

This is not a description of all the changes I made and extra book-keeping involved to turn Matt's script into a proper web app demo.

Warning! This post is an outsider's view of how Matt's salience code works. If you're already working with ML models in Python, this will feel torturously detailed.

I'm thinking that for someone fluent in this stuff, they see a lot more than just Matt's few dozen lines of code. They'll see the underlying equation, the shape of the matrices, how the data is laid out in memory, and crucially the alternatives. They see this is a graph problem. And then they're flipping through a mental catalog: random walks, spectral decomposition, diffusion, flow-based methods. Asking which one applies, what assumptions each makes, and whether the data is close enough to satisfy them. If not, maybe you aproximate, linearize, symmetrize, or threshold until it does. Knowing the toolkit and knowing when to bend the rules All of this is available to them effortlessly, automatically, without conscious thought.

I remember when I first learned Haskell. Reading the code was slow! I had to think quite a bit about what I was looking at. Then after about month, something clicked. I could suddenly read Haskell like English or C++. The translation became effortless, almost invisible.

I would bet my last donut the same thing can happen to you with numpy and and ML papers. At some point, fluency kicks in. You will read the equations an ML engineer would doodle out, intinctively have a feel for the dataflow, the element by element matrix operations under the hood, while simultaneously seeing in your mind's eye the equivalent high level numpy code.

Today I'm going to show you the equation, the matching source code, and the alternative theorem/algorithms/tricks you could have deployed at each step. I'll explain things that will seem painfully obvious to experts: this is a matrix multiplication—how many rows? how many columns? what's the shape of the output? That level of detail.

I'm essentially narrating the day-jobbers' automatic, subconscious processes. I hope laying out all the alternate forms (showing the choices, the reasons, the mental links between code and math) brings you one step closer to fluency.

Step 1: Break Text into Sentences

The first problem we need to solve is finding the sentences in a document. This is not as easy as splitting on newlines or periods. Consider this example:

"Dr. Smith earned his Ph.D. in 1995." ← This is one sentence, not three!

Fortunately, this problem has been adequately solved for decades. We are going to use the Punkt sentence splitter (2003) available in the Natural Language Toolkit (NLTK) Python package.

Step 2: Apply an Embedding Model

Now we have $N$ sentences. We convert each one into a high-dimensional vector that captures its meaning. For example:

𝐒 𝐞 𝐧 𝐭 𝐞 𝐧 𝐜 𝐞 𝐀 = [a_{1}, a_{2}, a_{3}, \dots, a_{D}]

𝐒 𝐞 𝐧 𝐭 𝐞 𝐧 𝐜 𝐞 𝐁 = [b_{1}, b_{2}, b_{3}, \dots, b_{D}]

𝐒 𝐞 𝐧 𝐭 𝐞 𝐧 𝐜 𝐞 𝐂 = [c_{1}, c_{2}, c_{3}, \dots, c_{D}]

Step 3: Build the Adjacency Matrix

Now we create a new $N \times N$ adjacency matrix $𝐀$ that measures how similar each pair of sentences is. For every pair of sentences $i$ and $j$ , we need the cosine similarity:

A_{i j} = \frac{𝐞_{i} \cdot 𝐞_{j}}{‖ 𝐞_{i} ‖ ‖ 𝐞_{j} ‖}

Each $A_{i j}$ represents how strongly sentence $i$ is connected to sentence $j$ .

$A_{i j} = 1$ means sentences are identical in meaning
$A_{i j} = 0$ means sentences are unrelated
$A_{i j} = - 1$ means sentences are opposite in meaning

You could work with these embedding vectors one at a time, using two for loops to build the adjacency matrix leetcode style. However, there's a way to delegate the computation to optimized libraries. Instead, organize all embeddings into a single matrix:

𝐄 = [\begin{matrix} a_{1} & a_{2} & a_{3} & \dots & a_{D} \\ b_{1} & b_{2} & b_{3} & \dots & b_{D} \\ c_{1} & c_{2} & c_{3} & \dots & c_{D} \\ ⋮ & ⋮ & ⋮ & ⋱ & ⋮ \\ z_{1} & z_{2} & z_{3} & \dots & z_{D} \end{matrix}]

Where:

$N$ = number of sentences (rows)
$D$ = embedding dimension (768 for all-mpnet-base-v2, 1024 for gte-large-en-v1.5)
Each row represents one sentence in semantic space

First, compute all the dot products at once:

𝐒 = 𝐄 𝐄^{T}

Since $𝐄$ is $N \times D$ and $𝐄^{T}$ is $D \times N$ , their product gives us an $N \times N$ matrix where entry $S_{i j} = 𝐞_{i} \cdot 𝐞_{j}$ .

Now we complete the cosine similarity formula by dividing each element by the product of the corresponding embedding norms:

A_{i j} = \frac{S_{i j}}{‖ 𝐞_{i} ‖ \cdot ‖ 𝐞_{j} ‖}

This gives us the full adjacency matrix:

𝐀 = [\begin{matrix} \frac{S_{11}}{‖ 𝐞_{1} ‖ \cdot ‖ 𝐞_{1} ‖} & \frac{S_{12}}{‖ 𝐞_{1} ‖ \cdot ‖ 𝐞_{2} ‖} & \dots & \frac{S_{1 N}}{‖ 𝐞_{1} ‖ \cdot ‖ 𝐞_{N} ‖} \\ \frac{S_{21}}{‖ 𝐞_{2} ‖ \cdot ‖ 𝐞_{1} ‖} & \frac{S_{22}}{‖ 𝐞_{2} ‖ \cdot ‖ 𝐞_{2} ‖} & \dots & \frac{S_{2 N}}{‖ 𝐞_{2} ‖ \cdot ‖ 𝐞_{N} ‖} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ \frac{S_{N 1}}{‖ 𝐞_{N} ‖ \cdot ‖ 𝐞_{1} ‖} & \frac{S_{N 2}}{‖ 𝐞_{N} ‖ \cdot ‖ 𝐞_{2} ‖} & \dots & \frac{S_{N N}}{‖ 𝐞_{N} ‖ \cdot ‖ 𝐞_{N} ‖} \end{matrix}]

Quick benchmark: For a $194 \times 768$ embeddings matrix (194 sentences):

Computing everything in Python for loops: 33.1 ms
Using $𝐄 𝐄^{T}$ for dot products, but element-by-element normalization in Python: 10.9 ms (saves 22.2 ms)
Using numpy broadcasting for normalization too: 0.13 ms

Broadcasting is a numpy feature where dividing arrays of different shapes automatically "stretches" the smaller array to match:

def cos_sim(a):
    sims = a @ a.T
    norms = np.linalg.norm(a, axis=-1, keepdims=True)
    sims /= norms      # Divides each row i by norm[i]
    sims /= norms.T    # Divides each column j by norm[j]
    return sims

The keepdims=True makes norms shape $(N, 1)$ instead of $(N,)$ , which is crucial. When transposed, $(N, 1)$ becomes $(1, N)$ , allowing the broadcasting to work for column-wise division. Transpose does not do anything to the shape $(N,)$ . I don't know why transpose works this way, but this seems like a nasty gotcha to look out for.

Step 4: Clean Up the Graph

We make two adjustments to the adjacency matrix to make our TextRank work:

Remove self-loops: Set diagonal to zero ( $A_{i i} = 0$ )
Remove negative edges: Set $A_{i j} = \max (0, A_{i j})$

A sentence shouldn't vote for its own importance. And sentences with opposite meanings get disconnected. I'll grant you 2) seems like a bit of a leap. I'll grant you that, as my understanding is the real reason we zero out negative entries is the normalization algorithm we want to use does not work with negative edges. Thus we worked backwards from the available normlization algorithms to handwave an assumption your document has a coherent main idea and that sentences are generally on-topic. We're betting that the topic with the most "semantic mass" is the correct topic. This is obviously not true for many documents:

Dialectical essays that deliberately contrast opposing viewpoints
Documents heavy with quotes that argue against something
Debate transcripts where both sides are equally important
Critical analysis that spends significant time explaining a position before refuting it

For example: "Nuclear power is dangerous. Critics say it causes meltdowns [...]. However, modern reactors are actually very safe."

The algorithm might highlight the criticism because multiple sentences cluster around "danger", even though the document's actual position is pro-nuclear. There's nothing inherent in the math that identifies authorial intent vs. quoted opposition.

Reflecting on my personal use cases, basically all documents I would run through such a tool to edit for compactness will be single topic persuasive essays. We should X. Its very unlikely I'll be able to indulge my penchant for dialetical essays at work.

Basically just keep in mind that we've made a pretty big foundational assumption that can fail when multiple competing viewpoints have similar semantic weight and the demo gives you no visual indication or warning this has happened.

Step 5: Normalize the Adjacency Matrix

The idea from TextRank is to treat similarity as a graph problem: simulate random walks and see where you're likely to end up. Sentences you frequently visit are important.

But first, we need to compute the degree matrix $𝐃$ . This tells us how "connected" each sentence is:

𝐃 = diag (𝐀 𝟏)

Here's what this means:

$𝐀 𝟏$ means "sum up each row of $𝐀$ "
For sentence $i$ , this gives us $d_{i} = \sum_{j} A_{i j}$ (the total similarity to all other sentences)
$diag (. . .)$ puts these sums on the diagonal of a matrix

The result is a diagonal matrix that looks like:

𝐃 = [\begin{matrix} d_{1} & 0 & 0 & \dots & 0 \\ 0 & d_{2} & 0 & \dots & 0 \\ 0 & 0 & d_{3} & \dots & 0 \\ ⋮ & ⋮ & ⋮ & ⋱ & ⋮ \\ 0 & 0 & 0 & \dots & d_{N} \end{matrix}]

Intuition: A sentence with high degree ( $d_{i}$ is large) is connected to many other sentences or has strong connections. A sentence with low degree is more isolated.

Now we use $𝐃$ to normalize $𝐀$ . There are two approaches:

Traditional normalization $\tilde{𝐀} = 𝐃^{- 1} 𝐀$ :

This creates a row-stochastic matrix (rows sum to 1)
Interpretation: "If I'm at sentence $i$ , what's the probability of jumping to sentence $j$ ?"
This is like a proper Markov chain transition matrix
Used in standard PageRank and TextRank
Supports directed graphs, a property useful for modeling web page navigation (page A links to B but B does not link back to A). We don't need this because similarity of sentence A to B is exactly the same value as B to A.

Spectral normalization $\tilde{𝐀} = 𝐃^{- 1 / 2} 𝐀 𝐃^{- 1 / 2}$ :

Treats the graph as unidirected (hey! that's us)
Symmetry preservation: if A is symmetric (which cosine similarity matrix is), then the normalized version stays symmetric
The eigenvalues are bounded in [-1, 1]
More uniform influence from all neighbors
Better numerical properties for exponentiation

With traditional normalization, sentences with many connections get their influence diluted. A sentence connected to 10 others splits its "voting power" into 10 pieces. A sentence connected to 2 others splits its power into just 2 pieces. This creates a bias against well-connected sentences.

Spectral normalization solves this problem. Well-connected sentences keep their influence proportional to connectivity.

I asked a ML engineer to explain the same idea to give you a Rosetta Stone to understand their jargon.

The traditional $𝐃^{- 1} 𝐀$ approach introduces potential node bias and lacks symmetry. Spectral normalization provides a more balanced representation by symmetrizing the adjacency matrix and ensuring more uniform neighbor influence. This method prevents high-degree nodes from dominating the graph's structure, creating a more equitable information propagation mechanism.

Step 6: Random Walk Simulation

We simulate importance propagation by raising the normalized matrix to a power:

𝐬 = 𝟏^{T} {\tilde{𝐀}}^{k}

Where:

$𝟏$ = vector of ones (start with equal weight on all sentences)
$k$ = random walk length (default: 5)
$𝐬$ = raw salience scores for each sentence

Intuition: After $k$ steps of random walking through the similarity graph, which sentences have we visited most? Those are the central, important sentences.

You might think we'd need to exponentiate the matrix—compute ${\tilde{𝐀}}^{k}$ first, then multiply by $𝟏^{T}$ . But there's a trick here. Since $𝟏^{T}$ is just a row vector of all ones (shape $1 \times N$ ), we can evaluate the expression left-to-right:

𝐬 = ((𝟏^{T} \tilde{𝐀}) \tilde{𝐀}) \tilde{𝐀} \dots

Each step is vector times matrix, which produces another vector. So we're doing $k$ iterations of vector-matrix multiplication, where each one is $N^{2}$ operations. Total cost: $k N^{2}$ .

If we were exponentiating the matrix instead, we'd be doing matrix-matrix multiplication ( $N^{3}$ operations per step). Since $k$ is small (say only 5 or 10) it's way more efficient to just evaluate left-to-right and keep passing a vector through. For a document with 200 sentences and $k = 5$ , that's roughly 200,000 operations instead of 8,000,000. A 40× speedup!

The choice of $k$ is important. A small $k$ (5-10 steps) means the random walk doesn't go very far. A sentence's importance is determined by its immediate neighborhood in the similarity graph. A large $k$ (letting it converge, like PageRank does) means influence propagates across the entire document, and you end up with only the sentences most central to the document's single main theme ranking highly.

For editing, we want the local structure. Documents aren't monolithic. Different paragraphs discuss different aspects of the topic. We want to see which sentences matter within their local context, not just identify the 3-5 globally most central sentences. So we use a small $k$ and deliberately stop before convergence.

As a bonus, this not-fully-converged state also happens to be computationally cheaper.

Step 7: Map Scores to Highlight Colors

Now we have a vector of raw salience scores from the random walk. Problem: these scores have no physical meaning. Different embedding models produce wildly different ranges:

Model A on Doc 1: scores range from 0.461 to 1.231
Model B on Doc 1: scores range from 0.892 to 1.059

We need to turn this vector of arbitrary numbers into CSS highlight opacities in [0, 1]. Here's the reasoning behind creating the remapping function:

Since I'm using this for editing documents, it seems reasonable I would only want to see highlights on roughly half the sentences—throw half away. (Of course, the threshold is configurable in the settings panel.)

Here's the idea: if we map scores into a range of size 2 (say, $X$ to $X + 2$ ), then we can threshold at the midpoint ( $X + 1$ ). Sentences scoring $X + 1$ to $X + 2$ get highlighted.

For a typical document, this gives you roughly 50% highlighted. But it's better than just hard-thresholding at exactly the top 50%: if 70% of sentences score above $X + 1$ , maybe your document is just really on-topic and you don't need to cut as much. If only 30% score above $X + 1$ , the document is scattered and only the truly central sentences get highlighted.

I could do trivial linear scaling to get scores into the range $X$ to $X + 2$ . But let's try to make the top sentences stand out more. One trick: exponentiation. Since human perception of brightness is not linear, exponentiation will preserve order but push the top values apart more. It makes the top few sentences really pop out.

Building the remapping function

Given a salience vector $𝐬$ with values ranging from $\min (𝐬)$ to $\max (𝐬)$ :

Find an exponent $p$ such that $\max (𝐬^{p}) \approx \min (𝐬^{p}) + 2$

Sure, it takes more work to find the right exponent for our target spread of 2, but that's still easy with a simple solver.
Set the threshold $τ$ at the midpoint of the remapped range.

This is where we draw the line between highlighted and non-highlighted sentences.

The final opacity mapping is:

{opacity}_{i} = clamp (s_{i}^{p} - τ, 0, 1)

For each document, I use a simple 1D solver to find $p$ and $τ$ that satisfy these constraints.

Final thought: This last step—converting the output from TextRank into highlight colors—is the weakest part of the system. I have no idea if it's actually correct or whether it even allows meaningful comparison between different embedding models. It works well enough for the intended purpose (quickly seeing which sentences to keep when editing), but the numerical values themselves are essentially arbitrary.

← Back to Demo