How TeX macros actually work: Part 2

Part 1 Part 2 Part 3 Part 4 Part 5 Part 6

Introduction: A story in pictures

As noted in Part 1, TeX has to “read” every character within your .tex file and that process of reading is more correctly referred to as scanning. Traditionally, TeX’s input-processing (scanning) is likened to TeX having “eyes” with which to observe the input, so we’ll adopt that time-tested analogy within the graphics below.

Graphic 1: The eyes are ready

We assume that TeX has obtained some input from a .tex file and is about to process our string of characters Hello World \jobname contained within a paragraph of text. It will check each character in turn and examine its category code.

TeX's eyes ready to scan a line of text

Graphic 2: Processing category codes

In the next graphic we see, in outline (with further details below), how TeX reacts to several different category codes. Note there are 16 category codes in total but, for simplicity, we are depicting the use of just three: 11, 10 and 0. Other character codes become important during TeX’s typesetting processes such as constructing tables, typesetting mathematics and recognizing macro parameters.

TeX reacting to several different category codes

Notes for Graphic 2

Here, we are considering TeX as it is reading (scanning) characters that form part of a paragraph of text. TeX inspects every character, checks its category code and takes the appropriate action based on the category code and TeX’s “mode” (a status based on what it is currently doing).

(green eyes) TeX will see that each of those characters has category code 11 (“letter”) and will forward those characters for typesetting as part of the paragraph it is building. However, TeX does not forward (use) the character code alone but, instead it uses the pair of numbers (character code, category code) to calculate a composite integer value called a character token (see below). Once that character token is produced it enters into TeX’s inner typesetting processes/algorithms.
(blue eyes) TeX sees a space character (ASCII 32) with category code 10 (“spacers”)—note that, as discussed, it’s quite possible for the category code of a space (ASCII 32), or any character, to have been changed to another value—prior to it being read-in by TeX.

How TeX actually processes characters with category code 10 (“spacers”) does vary depending on when/where TeX sees it—TeX’s current “mode”. For example, there are times when TeX will simply skip them. Here, TeX will know it has detected a character with category code 10 (which happens to be a space, ASCII 32) whilst processing paragraph text so it will eventually convert it to a so-called interword glue: a sort of flexible space that can stretch or shrink.

(red eyes) Here, TeX has observed a character which has a very important category code: 0 (escape character).

An escape character—any character with category code 0—tells TeX to switch into a special reading mode and carefully scan (read) the subsequent characters because they identify the name of a command, not text to be typeset. In TeX literature you will see the term “command” also being referred to as control sequence. After seeing an escape character, TeX checks the category code of the character that follows immediately after it; this is because TeX recognizes two types of command:

multi-letter commands called control words: the character following immediately after the escape character has category code 11. All subsequent characters that have category code 11 are considered to be part of the name of a command. TeX will stop looking for characters that form part of a command name when it detects any character that does not have category code 11—such as a space character with category code 10.
single-letter commands called control symbols: the character following immediately after the escape character does not have category code 11.

You can think of an escape character as triggering TeX to “escape” out of its usual scanning behaviour and adopt a different approach for the next few characters—this is indicated by the red dotted box showing that TeX will Start scanning for a command.

Graphic 3: Processing category code 11 (“letters”)

In Part 1 of this series we noted that each character TeX reads from its input is described by two integers:

character code: an integer defining the numeric reprepresentation of a character;
category code: a value from 0 to 15 that TeX assigns to every character which might appear in its input.

TeX uses these two pieces of information in the next stage of its processing: creating character tokens.

Graphic 3 expands on Graphic 2 to show what TeX does with these input characters having category code 11 (letter): it creates character tokens—integer values that TeX calculates using a combination of that character’s category code and character code.

Note: In this example we are only discussing characters with category code 11, but you should be aware that TeX also creates token values for input characters that have other category codes—except category code 0 which is never turned into a token: the escape character simply acts as a “switch” to trigger special processing.

TeX processing characters with category code 11

Graphic 5, below, will show what TeX does when it sees a character with category code 0 (an escape character).

Notes for Graphic3: Processing category code 11 (“letter”)

Here, we’ll focus on the green activity: what happens when TeX sees characters which have category code 11 (“letter”). After TeX has read a character, and determined it’s category code (here it is 11), what TeX does next is to combine this pair of numbers into a single integer called a character token: these tokens (integers) are forwarded to the next stage of TeX’s inner typesetting algorithms/processing. As noted, TeX will also create character tokens for characters with other category codes (i.e., not 11), here we are just using category code 11 as an example.

Each character token (an integer) permanently binds an input character with the category code assigned to that character at the time it was scanned (read-in) by TeX: that fact is of crucial importance in understanding the behaviour of TeX/LaTeX macros. Of course, during further processing, TeX will sometimes need to split-up a character token to determine which (character code, category code) pair was used to construct that token. However, once a character is read-in by TeX’s input (scanning) process, the character token value calculated by TeX results in that character being permanently coupled to the category code assigned to it at the time it was read-in.

Calculating character tokens

TeX engines use a simple formula to calculate a character token, \(T\), from a character with category code \(C\) and character code \(A\):

\[T = \text{constant} \times C + A\]

8-bit engines, such as pdfTeX use:

\[T = 256\times C + A\]

Unicode-aware engines, such as XeTeX or LuaTeX, have to use a different formula because, under Unicode, character codes can be much larger than the maximum of 255 in the older 8-bit ASCII encoding world. XeTeX, for example, uses:

\[T= 2^{21}\times C + A \hskip5mm \text{(where } A \text{ is a Unicode character code value)}\]

Once again. it is worth noting that characters with category code 0 are not converted into character tokens: the category code 0 has a very special place in TeX’s input filtering and is used purely a “switch” to trigger TeX into a special mode of scanning the next few characters. Graphic 5 deals with this.

Graphic 4: Processing category code 10 (“spacer”)

TeX’s handling of characters with category code 10 (“spacer”) depends on what TeX is currently working on when it detects a character of category code 10 within the input. In our example, TeX is performing routine paragraph processing and the space character, with category code 10, will be converted to interword glue.

TeX processing characters with category code 10

TeX’s handling of spaces can appear to be quite idiosyncratic but a good overview can be found in chapters 1 and 2 of TeX by Topic by Victor Eijkhout—you can download a free PDF copy via his website.

For example, when TeX sees a character of category code 10, there are times when TeX will:

skip (ignore) all of them—e.g., when TeX is in vertical mode;
convert multiple spacers to a single spacer—skipping extra spaces such as when processing a paragraph;
absorb them—e.g. absorbing a single space after a command name;

Note too, there are times when TeX will also generate spaces—by converting end-of-line characters to a space. The behaviour/treatment of space characters (any character with category code 10) is one TeX’s “idiosyncrasies”: it takes time/practice to become familiar (comfortable) with this aspect of TeX.

Graphic 5a: Processing category code 0 (an “escape character”)

In this graphic, TeX has processed all characters up to the \ character, which has a category code of 0: the “escape character”—we’ll use a further sequence of graphics to show TeX processes an escape character and identifies the name of a command.

TeX processing characters with category code 0

Graphic 5b: Looking for a command name

In this graphic we look into the red-dotted box section (Start scanning for a command) to see what TeX does after it has seen an escape character.

TeX looking for a command name

Notes to Graphic 5b

Once recognized, the escape character has done its job: it acted as a switch and does not take part in any further processing—specifically, it is not converted to a character token.
For convenience, we’ll repeat some detail mentioned earlier. After seeing an escape character, TeX checks the category code of the character that follows immediately after it; this is because TeX recognizes two types of command:
- multi-letter commands called control words: the character following immediately after the escape character has category code 11. All subsequent characters that have category code 11 are considered to form the name of a command (control word). TeX will stop looking for characters that form part of a command name when it detects any character that does not have category code 11—such as a space character with category code 10.
- single-letter commands called control symbols: the character following immediately after the escape character does not have category code 11.
In our example, the first character after the \ is a j (category code 11) which tells TeX to look for a command that is (potentially) a multi-letter sequence of characters with category code 11.
TeX keeps checking for more characters that have category code 11. As soon as it detects a character with any other category code, such as a space with category code 10, TeX knows that it has reached the end of the command name. Just to emphasize the point: here it was a space character (category code 10) which “terminated” the end of the command but it could have been any character that did not have category code 11.

Part 3

In Part 3 we continue on from Graphic 5b to complete this part of the story—how TeX identifies a command—and move on what it does next. We also take a deeper look into some internal aspects of TeX’s processing—parts of which can be skipped on first reading, unless you really enjoy the details.