-- Leo's gemini proxy

-- Connecting to auragem.letz.dev:1965...

-- Connected

-- Sending request

-- Meta line: 20 text/gemini; lang=en

2024-03-24 The Necessary Semantics behind Emphasis and Strong

(Sometimes known as Italics and Bold)


This is part of my series on re-assessing the design of Gemini and Gopher:

Gopher's Uncontextualized Directories vs. Gemini's Contextualized Directories

What Gemini Gets Wrong With Anti-Extensibility


There are two pieces of information that often go underappreciated within geminispace: emphasis and strong. Both provide different levels of emphasis. Oftentimes we associate these with styling or presentation, but this is an incorrect way to think about them. HTML5 deliberately switched from bold and italics to emphasis and strong for this very reason.


The Presentation of Emphasis and Strong


The *presentation* of emphasis and strong is italics and bold. These happen to be *visual* presentations. Italics have a lighter emphasis than bold, so italics is written slanted, whereas bold uses a heavier font weight. The heavier font weight, for visual readers, captures your attention even more than the slant. Definitions and terms often use bold, whereas verbal inflections use italics.


However, this is not the only way to present emphasis and strong. In an audible presentation, emphasis has a lighter inflection shift than strong does. Strong might stress every syllable, but emphasis often either further strengthens the stress of the stressed syllable(s) that are italicized, within the respective language, or it raises the stress of every syllable, *while keeping the stress of each syllable distinct.* **The presentation conveys the semantic detail**.


The Semantics of Emphasis and Strong


We use emphasis and strong in our daily life. When we're sarcastic, we *obviously* use italics. When we are assertive, **of course** we use strong! Sarcasm and assertiveness are not presentations, they are semantic details. They change the meaning of a sentence. What do they change? They change *the meaning* of a sentence. What do they do? They *change* the meaning of a sentence. Of what do they change? They change the meaning *of a sentence*.


What am I responding with in each sentence; what am I emphasizing? I am emphasizing the parts that are presented with the emphasis markers! Not only am I emphasising them visually, but I am emphasizing them in response to the questions. I am emphasizing them as semantically important. They are directing listeners/readers to what part of the entire sentence lies the answer to the actual question. In speech we change our vocal inflections, in writing we add visual markers, in sign language one uses facial expressions or longer durations of a sign or one introduces the use of the pinky finger.


In fact, this emphasis is a semantic detail that already exists in some languages, called "fronting". Fronting happens when you move what part of the sentence you want to emphasize to the front. The front you emphasize by placing it first. All the time, Yoda does this. Fronting is what he's using to achieve OSV syntax. He's using fronting. Using fronting is what he does. Semantic detail becomes attached to syntax. When attached to syntax, syntax becomes our way of conveying meaning. What is emphasized is the meaning that we are conveying. Syntax and vocal inflections are our presentation, but the emphasis itself is the meaning. First, English places emphasized things, but other languages might place them last.


Emphasis and Strong **have semantic meaning.** They are **important**, and they are often **necessary**. They can convey inflections that fronting cannot convey, or they can provide a form of emphasis that doesn't require completely changing the syntax of your sentence.


The Use of Emphasis and Strong


Regardless of Gemini's spec's lack of emphasis and strong, people still use them within geminispace. I frequently use them myself. There has become a natural consensus around their usage online via asterisks or underscores. This probably started with Markdown and then spread across the internet, but they are now used in Discord and Slack and other chatting software, WYSIWYG editors, and some word processor-like software. They have become a natural consensus because they are useful and sometimes necessary.


Gemini's spec disregards them only so that clients aren't *required* to try to parse out every word of a paragraph and split things up into different strings to be stored with a type that specifies whether it's emphasized, strengthened, or normal text. Parsers would then have to handle the rendering of these groups of split strings. Clients can still choose to do this if they so wish even though it's not in the spec. It's actually harder to deal with this in GUIs than it is in terminals, especially if you are just printing text out as it comes in from the connection, which is very doable for terminal clients.


A Consistent Syntax for Emphasis and Strong in Markdown


One criticism that I have seen is that emphasis and strong don't have a consistent syntax. This appears to be true *only* if one looks at Markdown specs, but it is *not* true when one looks at how emphasis and strong are actually used in most languages.


Every use of emphasis and strong above has occured either on whitespace or punctuation boundaries. And this is in fact how some Markdown specs would describe the syntax. While I haven't done any studies, I would bet that 99% of the usage of Emphasis and Strong adheres to this.


So what about the 1%? This is the percentage of text that tries to strengthen a syllable within a word:

This would be an ex**am**ple. And here's a **bet**ter one.

What's the rule here? For "better", one side seems to be on a space boundary, while the other side is on a character boundary (character excluding whitespace). However, for "example", both sides are on character boundaries. This happens to be less common, but there could be a solution here.


Notice that for each toggle, one side must always be next to a character. In "better", the first toggle is next to the character `b` on its right. The second toggle is next to the character `t` on its left, and the other character `t` on its right.


In "example", the first toggle is next to two characters just like the second toggle. If we use this on a whole word, you will find that both toggles are next to a character: "usage on a **whole** word". And if you use it on a word that's next to a punctuation, you will find the same thing again: "usage on a **word,** where the second toggle is next to a punctuation".


So, if we think of these markers as toggles, we will see that a toggle is either `*` or `**` where at least one side is next to a character that is not whitespace. This does exclude usages like the following: "The ** floating strong toggles ** would not parse". I do believe that this is a good thing, however, because it prevents us from mistaking strong toggles with other usages of asterisks.


Note, that if we allow emphasis and strong toggles to be surrounded by non-whitespace characters on both sides, then this will break common usecases of underscores and asterisks, like censoring words ("f*ck") and camel_case. One can solve this by requiring all toggles, or perhaps just some toggles, have a whitespace or punctuation on one side, but this does prevent bolding syllables in words (unless one chooses to use a hyphen, **some**-thing like this).


The final thing to consider is whether strong toggles can be next to each other, like in the following: "This is a **** example." To fix this we simply modify our rule:

A toggle is either `*` or `**` where one side is next to a character that is not a whitespace or an asterisk.


Nesting Strong and Emphasis


So, now we have a big problem. We can't nest strong and emphasis! Because we use the same symbol for them both, the asterisk, we can't actually place them next to each other, especially with the rule that we've created above.


We fix this by changing our symbols, and this is precisely what AsciiDoc does. Emphasis uses underscores (`_`) and strong uses asterisks (`*`). Now we can nest them with _*each other.*_


We can change our rules by splitting them up now:

A strong toggle is an asterisk `*` where one side is next to a character that is not a whitespace or an asterisk.

An emphasis toggle is an underscore `_` where one side is next to a character that is not a whitespace or an underscore.

A whitespace is a tab, newline, space, EOF, or the start of a file/string/stream.


Inconsistency with Common Usage


However, we have a new problem: this has changed completely from what common usage is. Common usage seems to be that *one* asterisk or underscore is for emphasis, and *two* asterisks or underscores is for strong. But now we've changed both toggles into using one symbol! In all existing text, all emphasis would turn to strong, and all strongs would disappear.


Unfortunately, the common usage also prevents nesting, so it appears we cannot have *both* nesting and common usage.


Another option


Our other option is to make strong toggles two asterisks and emphasis toggles one underscore. We keep the different symbols, and we use part of the common usage. However, we now ignore cases where someone uses two underscores for strongs and one asterisk for emphasis.


Reassesing the Tradeoffs


So we have 3 options here:

1. We can keep the common usage and not allow nesting.

2. We can allow nesting by using different symbols for strong and emphasis, and reduce clutter by using one character for each toggle, but we sacrifice consistency with common usage and expectations.

3. We can allow nesting by using different symbols for strong and emphasis, but we sacrifice the reduction of clutter by using two characters for strong toggles, and we exclude some common usage.


Option 3 seems to be the best balance. It's what I will be going with for the Scroll Protocol, although parsing them will be optional for clients:

Scroll Protocol


Bullets and Bold/Emphasis


Another consideration is distinguishing between emhpasis/bold and the start of a bullet. Based on our rules above, the toggles must have one side next to a character that is not an asterisk/underscore or whitespace. If our linetype for bullets is already on a whitespace boundary (the newline just before it), then we just need to put the other side on a whitespace boundary! This is what both the Gemini spec and the Scroll spec do:

* This is a bullet, because there's a whitespace after the asterisk
  and the asterisk occurs at the beginning of a line.
**This is bold** because there is no whitespace after the two asterisks,
and because the first toggle has one side that is next to a character
that is not whitespace or an asterisk.

Parsing


The final challenge is in parsing. Now that we've introduced our rules, this has complicated our parsing. Toggles where the character or punctuation is on the left side of the toggle are somewhat easy to parse, but toggles where the character is on the right are slightly harder, because now we have to peek at the next character when we see a toggle, and this is particularly hard if we are streaming in and printing gemtext or scrolltext character-by-character rather than line-by-line.


Below is my solution written in Golang. It streams in text from a `bufio.Reader` and prints it back out using VT-100 escape sequences. One could easily modify this by printing to an io.Writer or strings.Writer instead.


Additionally, one could reimplement this using a regular `io.Reader` or `strings.Reader` and without using `reader.UnreadRune()` function. I did not do this because it would make the code more complicated and less readable, but it is definitely possible to do, you would just have to store more parsing state. The important point to note is that the lack of a way to "peek" or unread characters/runes does not make parsing strong and emphasis within a character stream impossible.


Finally, one could make this parsing even easier if we are reading whole lines from a stream/connection before presenting those lines to the user, rather than individual characters, but I wanted to show that it is possible to do as individual characters/runes are received from the stream.


Here is a golang file using the example parser code from below:

Terminal Emphasis and Strong Parsing & Printing


I also have this variation that will not reset at the end of the reader (EOF), which allows me to call the function multiple times on word-wrapped lines. This code also has AsciiDoc and Markdown variants:

The Code my Scroll-Term Client Uses


Scroll-Term supports Gemini, Nex, and my Scroll protocol. You can download precompiled binaries of Scroll-Term here:

Precompiled Binaries


Example Parser


2024 - Christian Lee Seibold

License: MIT


// Emphasis - one underscore
// Strong - two asterisks
// Monospace - one backtick (`)
func printWithEmphasisAndStrong(reader *bufio.Reader) {
	previousRune := '\n'
	inStrong := false
	inEmphasis := false
	inMonospace := false

	for {
		r, _, err := reader.ReadRune()
		if err != nil {
			// EOF - reset everything, since this should be the end of the paragraph
			fmt.Printf("\x1B[m")
			return
		}

		if r == '`' || (!inMonospace && (r == '*' || r == '_')) {
			toggleRune := r
			toggle := string(r)
			unread := true

			// Get the next rune
			r, _, err = reader.ReadRune()
			if err != nil {
				// Set rune to new line, so it registers as a whitespace.
				r = '\n'
				unread = false
			}

			// If r is an asterisk, we require two for the toggle, so read the next rune. Otherwise, if just one asterisk,
			// print the runes and continue
			if toggleRune == '*' && r == '*' {
				toggle = "**"
				r, _, err = reader.ReadRune()
				if err != nil {
					// Set rune to new line, so it registers as a whitespace.
					r = '\n'
					unread = false
				}
			} else if toggleRune == '*' {
				fmt.Printf("*")
				_ = reader.UnreadRune()
				previousRune = '*'
				continue
			}

			if (!unicode.IsSpace(r) && r != toggleRune) || (!unicode.IsSpace(previousRune) && previousRune != toggleRune) {
				switch toggleRune {
				case '`':
					if inMonospace {
						// Reset
						inMonospace = false
						fmt.Printf("\x1B[22m")
						// Since 22m disables bold *and* dim, set bold again if necessary
						if inStrong {
							fmt.Printf("\x1B[1m")
						}
					} else {
						// Set
						inMonospace = true
						fmt.Printf("\x1B[2m")
					}
				case '*':
					if inStrong {
						// Reset
						inStrong = false
						// 22m disables bold *and* dim. However, we cannot use strong inside monospace toggles, so this doesn't matter.
						fmt.Printf("\x1B[22m")
					} else {
						// Set
						inStrong = true
						fmt.Printf("\x1B[1m")
					}
				case '_':
					if inEmphasis {
						// Reset
						inEmphasis = false
						fmt.Printf("\x1B[23m")
					} else {
						// Set
						inEmphasis = true
						fmt.Printf("\x1B[3m") // Replace this with 4m for underline in terminals that don't support emphasis.
					}
				}
				_ = reader.UnreadRune()
				if unread {
					previousRune = toggleRune
				}
			} else {
				// Not a toggle, print the toggle and unread the rune
				fmt.Printf("%s", toggle)
				_ = reader.UnreadRune()
				previousRune = toggleRune
			}
		} else {
			fmt.Printf("%c", r)
			previousRune = r
		}
	}
}

Continue the Series


Here are the next articles in this series:

2024-03-25 The Simplicity of List Nesting: How AsciiDoc Does It

2024-03-26 The Case for a 4th-Level Heading

2024-03-27 Who Controls Presentation? Presentation vs. Semantics

2024-03-28 Headers, Footers, Sidebars, and Footnotes

-- Response ended

-- Page fetched on Tue May 7 03:11:32 2024