--- title: How Salience Works --- import { Math } from "~/components/math/math" # How Salience Works Salience highlights important sentences by treating your document as a graph where sentences that talk about similar things are connected. We then figure out which sentences are most "central" to the document's themes. ## Step 1: Break Text into Sentences We use NLTK's Punkt tokenizer to split text into sentences. This handles tricky cases where simple punctuation splitting fails: *"Dr. Smith earned his Ph.D. in 1995."* ← This is **one** sentence, not three! ## Step 2: Convert Sentences to Embeddings Now we have

sentences. We convert each one into a high-dimensional vector that captures its meaning:

This gives us an **embeddings matrix**

where each row is one sentence:

Where: -

= number of sentences (rows) -

= embedding dimension (768 for all-mpnet-base-v2, 1024 for gte-large-en-v1.5) - Each row represents one sentence in semantic space ## Step 3: Build the Adjacency Matrix Now we create a new matrix

that measures how similar each pair of sentences is. For every pair of sentences

and

, we compute:

This is the **cosine similarity** between their embedding vectors. It tells us: -

means sentences are identical in meaning -

means sentences are unrelated -

means sentences are opposite in meaning The result is an

**adjacency matrix** where

represents how strongly sentence

is connected to sentence

. ## Step 4: Clean Up the Graph We make two adjustments to the adjacency matrix to get a cleaner graph: 1. **Remove self-loops:** Set diagonal to zero (

) - A sentence shouldn't vote for its own importance 2. **Remove negative edges:** Set

- Sentences with opposite meanings get disconnected **Important assumption:** This assumes your document has a coherent main idea and that sentences are generally on-topic. We're betting that the topic with the most "semantic mass" is the *correct* topic. **Where this breaks down:** - **Dialectical essays** that deliberately contrast opposing viewpoints - **Documents heavy with quotes** that argue against something - **Debate transcripts** where both sides are equally important - **Critical analysis** that spends significant time explaining a position before refuting it For example: "Nuclear power is dangerous. Critics say it causes meltdowns. However, modern reactors are actually very safe." The algorithm might highlight the criticism because multiple sentences cluster around "danger", even though the document's actual position is pro-nuclear. There's nothing inherent in the math that identifies authorial intent vs. quoted opposition. **Bottom line:** This technique works well for coherent, single-perspective documents. It can fail when multiple competing viewpoints have similar semantic weight. ## Step 5: Normalize the Adjacency Matrix The idea from **TextRank** is to treat similarity as a graph problem: simulate random walks and see where you're likely to end up. Sentences you frequently visit are important. But first, we need to compute the **degree matrix**

. This tells us how "connected" each sentence is:

Here's what this means: -

means "sum up each row of

" - For sentence

, this gives us

(the total similarity to all other sentences) -

puts these sums on the diagonal of a matrix The result is a diagonal matrix that looks like:

**Intuition:** A sentence with high degree (

is large) is connected to many other sentences or has strong connections. A sentence with low degree is more isolated. Now we use

to normalize

. There are two approaches: Traditional normalization

: - This creates a row-stochastic matrix (rows sum to 1) - Interpretation: "If I'm at sentence

, what's the probability of jumping to sentence

?" - This is like a proper Markov chain transition matrix - Used in standard PageRank and TextRank Spectral normalization

: - Used in spectral clustering and graph analysis - Symmetry preservation: if A is symmetric (which cosine similarity matrix is), then the normalized version stays symmetric - The eigenvalues are bounded in [-1, 1] - More uniform influence from all neighbors - Better numerical properties for exponentiation The traditional

approach introduces potential node bias and lacks symmetry. Spectral normalization provides a more balanced representation by symmetrizing the adjacency matrix and ensuring more uniform neighbor influence. This method prevents high-degree nodes from dominating the graph's structure, creating a more equitable information propagation mechanism. With traditional normalization, sentences with many connections get their influence diluted. A sentence connected to 10 others splits its "voting power" into 10 pieces. A sentence connected to 2 others splits its power into just 2 pieces. This creates a bias against well-connected sentences. Spectral normalization treats the graph as **undirected**, which matches how semantic similarity works. Well-connected sentences keep their influence proportional to connectivity. Two sentences that are similar to each other should have equal influence on each other, not asymmetric transition probabilities. ## Step 6: Random Walk Simulation We simulate importance propagation by raising the normalized matrix to a power:

Where: -

= vector of ones (start with equal weight on all sentences) -

= random walk length (default: 5) -

= raw salience scores for each sentence **Intuition:** After

steps of random walking through the similarity graph, which sentences have we visited most? Those are the central, important sentences. ## Step 7: Map Scores to Highlight Colors Now we have a vector of raw salience scores from the random walk. Problem: these scores have no physical meaning. Different embedding models produce wildly different ranges: - Model A on Doc 1: `[0.461, 1.231]` - Model B on Doc 2: `[0.892, 1.059]` We need to turn this vector of arbitrary numbers into CSS highlight opacities in `[0, 1]`. Here's the reasoning behind creating the remapping function: I could do trivial linear scaling - multiply by a constant to get scores into some range like

. But let's try to make the top sentences stand out more. One trick: exponentiation. Since human perception of brightness is not linear, exponentiation will preserve order but push the top values apart more. It makes the top few sentences really pop out. **Building the remapping function** Given a salience vector

with values ranging from

: 1. **Find an exponent**

such that

Sure, it takes more work to find the right exponent for our target spread of 2, but that's still easy with a simple solver. 2. **Find a threshold**

such that 50% of the sentences get clamped to zero. Since I'm using this for editing documents, I only want to see highlights on roughly half the sentences—the important half. The final opacity mapping is:

For each document, I use a simple 1D solver to find

and

that satisfy these constraints. **Final thought:** This last step—converting the output from TextRank into highlight colors—is the weakest part of the system. I have no idea if it's actually correct or whether it even allows meaningful comparison between different embedding models. It works well enough for the intended purpose (quickly seeing which sentences to keep when editing), but the numerical values themselves are essentially arbitrary. --- [← Back to App](/)