Today I Learned: Helpful Error Messages are Easy
I was working on a personal command line tool that did some markdown parsing. I was curious how easy it would be to upgrade from the low effort stderr print statements to the very helpful error message style I enjoyed from Elm.
Turns out it was, so let me tell you about it.
The anatomy of a good error message
I checked how Rust did things and peeked at TypeScript. My needs (and probably yours) are much simpler than a compiler, so I will show you the simplified version that blends ideas from all 3 languages.
First, the terminology. Rust and TypeScript call this a Diagnostic. elm-compiler
calls it a Problem. A Diagnostic represents one reportable issue in your code.
Fun fact: in Elm the module name for the error formatting subsystem is Reporting
.
type Diagnostic struct {
Filename string
Code string
Level DiagnosticLevel
Title string
Description string
Span Span
Advice string
}
Here is an example of how a Diagnostic gets formatted:
example.md:3:11 - error MD001: Invalid markdown syntax
I was looking for two asterisks to close the bold syntax, but only found one.
3 | This is a **markdown* document with some content.
^^^^^^^^^^^
To fix this, add a second asterisk (*) to properly close the bold formatting:
**markdown** instead of **markdown*
Bold text in Markdown requires matching pairs of double asterisks (**) or double
underscores (__) around the text you want to emphasize. Make sure each opening
pair has a corresponding closing pair.
First line: Location, severity, code, and title. This line packs in filename, line number, column number, log level (error, warning, etc.), an error code for easy googling, and a human-readable error name. All on one line.
example.md:3:11 - error MD001: Invalid markdown syntax
Description block: The compiler's perspective on the problem. What concretely was the compiler trying to do? Was it parsing the syntax and expected to find a specific token? Was it checking types and noticed a mismatch?
The Elm style is to make the compiler talk in first person: "I expected to find X but found Y" or "I can't find this type."
I was looking for two asterisks to close the bold syntax, but only found one.
Code span: Show the actual problematic code with highlighting.
3 | This is a **markdown* document with some content.
^^^^^^^^^^^
Advice: The text after the code block is where you give users both the quick
fix and the understanding to prevent the same mistake in the future. I call it
Advice. elm-compiler
calls it postHint. rustc
calls it message, help, note.
To fix this, add a second asterisk (*) to properly close the bold formatting:
**markdown** instead of **markdown*
Bold text in Markdown requires matching pairs of double asterisks (**) or double
underscores (__) around the text you want to emphasize. Make sure each opening
pair has a corresponding closing pair.
That is all we probably need for simpler programs. If you do happen to be building a compiler, you will probably need more.
Part 2: More advanced features
How does the Rust compiler draw multiple different underlines or print multiple regions in the same error message?
error: this enum takes 2 type arguments but only 1 type argument was supplied
--> src/main.rs:1:11
|
1 | fn f() -> Result<()> {
| ^^^^^^ -- supplied 1 type argument
| |
| expected 2 type arguments
|
note: enum defined here, with 2 type parameters: `T`, `E`
--> /lib/rustlib/src/rust/library/core/src/result.rs:241:10
|
241 | pub enum Result<T, E> {
| ^^^^^^ - -
The trick is you can also add "labels" to other spans of code. Each label gets a message string. In Go the API would look like this:
func (d *Diagnostic) Label(span Span, message string)
You can get almost all of the extra formatting you need with only this simple API addition.
type Diagnostic struct {
// ... fields stay the same
Labels []struct{Span: Span; Message: string}
}
If the label is contained by the primary span then we include an underline and message on a line inserted in between the formatted lines of the code output.
Labels that are not contained within your primary span are appended as new top level entries after printing the first span. All the formatting code for filepaths and spans gets reused.
Pedantic note: Adding multiple labels to the same span seems fine, just draw more arms from the same underline to different messages. But is there a big design flaw for handling partially overlapping labels? In practice, I think it turns out safe to ignore. Compiler developers work around defining what would happen by always rewording their messages to avoid overlaps. After all, the locations of the Diagnostic labels are statically known, even if the text contents can vary wildly.
The only tricky part is that this extra feature means that every span needs to include the source file path. Does this sound expensive? Maybe. Previously we could store spans as slice indexes into the vector of file content bytes. Two small integers.
Do we have to start storing 50 to 100 bytes of filename for every span?
No.
Almost all compilers use a technique called interning. Instead of storing the filename string multiple times, you store it once in a central table and hand out integer IDs that point to it. So your span just needs 4 bytes for the file ID instead of however many bytes the filename string takes. This keeps the memory overhead reasonable.
If you don't implement labels, then you don't have to worry about this. You can do what I did and take the lazy approach of defining the filename as a string on the Diagnostic type.
The End
So there you have it. A simple design for helpful error messages. And with LLMs, it is easier than ever. I write a stream of consciousness info dump about this error condition and how we got there, and the implications. Then boom! Translate that into concise and helpful Description and Advice messages.