How TeX macros actually work: Part 5

Part 1 Part 2 Part 3 Part 4 Part 5 Part 6

Introduction and overview

In Part 4 we reviewed some basic properties of TeX macros in preparation for the next two articles where we take a close look at the underlying mechanics of TeX macros: specialized token lists. In these final two articles we use diagrams, called node lists, that were prepared from data generated using a specially modified version of Knuth’s original TeX software—those modifications were designed to access internal TeX data structures which are normally inaccessible to the user. By “hooking into” TeX’s internal macro-processing and execution routines it was possible to write out graphical data which enables a more detailed and accurate discussion of TeX’s macro-processing behaviour. Overleaf hopes that these diagrams assist readers to achieve a better understanding of how TeX macros really work.

Possible additional background reading

Overleaf has already published two token-related articles that provide additional background information on TeX tokens and TeX token lists. Do please take time to check them out if you need to fill any gaps in your understanding and help you get the most from Parts 5 and 6 of this series.

Macros as token lists

When TeX detects a macro-creation command (\def, \edef, \gdef or \xdef) within the input stream it triggers a process which converts both of the sections <parameter text><replacement text> of our macro’s definition into one long token list—but a very particular type of token list.

Token lists for macros are slightly different to other token lists used within TeX because they contain “special” token values that only processes internal to TeX itself can create/generate: those special tokens cannot be directly created by any commands that you can include in your .tex file. TeX creates and uses those “special” token values to help with processing your macro call, as we’ll explore and explain below.

A brief word on how token lists are stored: nodes

To store a list of tokens (integer values) TeX uses a data structure called a linked list, which, in TeX’s case, comprises a list of so-called nodes. You can think of a node as a small package of computer memory which can be used to store a collection of data items. To store a macro, these nodes are strung together like a chain, where each node (link in the chain) can store several pieces of information—including a token value and the memory address of the next node in the list. For further information, you can read the article What is a TeX token list but the following diagram summarizes the key features of a macro stored as a token list:

Diagram of a TeX macro token list stored as a linked-node node list

Reminder: the 4 parts of a macro definition

As discussed in Part 4, the structure of any macro can be written as:

<TeX macro primitive><macro name><parameter text>{<replacement text>}

where:

<TeX macro primitive> = one of \def, \edef, \gdef or \xdef;
<macro name>=the name of your macro, such as \foo;
<parameter text> can be “null” (not present) or it can be an string of delimiter tokens and macro parameter tokens;
<replacement text> is the actual body of your macro: the section that is “executed” when you call the macro.

NOTE: (As also observed in Part 4) throughout the discussion we are assuming that <macro name> will be followed by a space character of category code 10 to act as a delimiter to terminate the <macro name>. We have not explicity shown that space character in our text/discussion but we assume it is there. Strictly speaking, we should represent it something like this:

<TeX macro primitive><macro name><space><parameter text>{<replacement text>}

However, we will omit explicit inclusion of a <space> character and implicitly assume its presence.

NOTE: The characters { and } do not become part of the macro token list: their purpose is simply to tell text’s input scanner (which creates tokens) where the <replacement text> starts and stops.

When TeX defines a macro, the sections <parameter text><replacement text> are converted into one long continuous token list—the total number of tokens in that list depends on the complexity of the macro. As we’ve seen, the <parameter text> section has a specific purpose of acting as a “token template” or “blueprint” that TeX uses to pick out the tokens which form the arguments (values) to use with the actual macro: i.e., the tokens to feed into the <replacement text>.

To firm-up these ideas, let’s take an example macro but keep it short so that subsequent diagrams do not become too cluttered:

\def\foo A#1\fake{123 #1}

For our macro, \foo

<parameter text> = A#1\fake
<replacement text> = 123 #1

Although this example is a simple macro, it contains all the features we need to explore.

As noted, TeX will convert <parameter text><replacement text> into one long token list which you can see in the diagram below. In our example, the tokens formed from A#1\fake{123 #1} have been converted to a consecutive sequence of tokens stored in a token list (as a linked list of nodes).

Graphic showing a real macro token list

The following diagram, showing how the macro \def\foo A#1\fake{123 #1} is stored, uses real data from inside a TeX engine. It was created using a customized version of Knuth’s TeX that was modified with additional code to intercept macro calls, examine TeX’s internal data and export it to format for processing using an open-source graphics program called Graphviz.

You can download the following graphic as a PDF file (675 KB) or SVG file (1.8 MB).

Diagram of an annoted TeX token list

Understanding the nodes

Within the diagram above you’ll see that each node contains two data items called the next node and the current node. These are just integer values that represent memory locations inside TeX—locations where other nodes are stored. The values of next node and current node are not important, they simply store the locations (memory addresses) which allow nodes to be linked together in a list.

The meaning of next node and current node

Back to the example

In the node diagram, the token list formed from A#1\fake{123 #1} contains several “special tokens” introduced at the start of this article. In addition, the node list representing our macro starts with a “special first node”: we’ll explore what these are and what they do.

The very first item in a macro token list (and some other token list types) does not store a token value but a data item called the macro’s reference count which TeX uses to track the use of the macro.

A reference count node is the first one in a token list

The first token of the <parameter text> is stored in the node that follows immediately after the reference count: you can see it is a token representing the letter A with category code 11. From discussions in Parts 2 and 3 we know that a character token is calculated using

\[\text{token value}=256\times \text{category code} + \text{character code}\]

which, for a letter A with category code 11, is

\[\text{token value}=256\times 11 + 65\]

giving the value 2881, as shown in the node.

The “command” `\fake` used in `\foo`

Within our macro definition \def\foo A#1\fake{123 #1} one of the delimiters is an undefined command \fake which is stored within the token list as part of the <parameter text> section. As you can see, within the overall macro token list \fake is a token whose value is 19491—an integer value calculated by TeX using the formula discussed in Part 3. When TeX attempts to execute \foo it will expect to find the \fake token value at the end of the <parameter text> section. TeX will not try to execute the \fake command because its role is merely to provide a form of “punctuation” within the <parameter text> “token template”.

$Using the \fake command token as a macro delimiter$

Special tokens in the `<parameter text>` token list

The “end match” token

When calling a macro, TeX’s first task is to scan the macro as typed by the user and compare the tokens present in the user’s <parameter text> section to the tokens contained within the template <parameter text> stored in memory (created at the time the macro was defined). Because the macro’s full definition, constructed from <parameter text><replacement text> is stored as one long consecutive list of tokens, TeX needs to know where, in that token list, <parameter text> stops and where <replacement text> starts. To achieve this, when TeX is defining the macro (building the token list) it will insert a special terminator token called an end match token as the very last token in the set of tokens generated from <parameter text>. The end match token cannot be generated from user commands, only TeX itself can create it, hence TeX is certain to detect the end of the <parameter text>.

Showing the end match token in a TeX token list

Here, we can see that the first token following after end match is a token representing the digit 1 with category code 12. This should be expected because the <replacement text> for our macro \foo is 123 #1—i.e., it starts with the token representing the digit 1 (with category code 12).

From the discussion in Parts 2 and 3 we know that a character token is calculated using

\[\text{token value}=256\times \text{category code} + \text{character code}\]

which, for a digit 1 with category code 12 is

\[\text{token value}=256\times 12 + 49\]

giving the token value 3121, as shown in the node.

“match parameter” tokens

When TeX stores the macro definition, it converts any parameter tokens (#1, #2… #9) within <parameter text> to one called a match parameter token. These tokens tell TeX that it needs to start looking for tokens, within the user’s macro call, that are the arguments of the macro.

Showing the match parameter token in a TeX token list

Special tokens in the `<replacement text>` token list

“output parameter” tokens

When TeX has processed everything and is ready to actually run (expand) the macro, the output parameter tokens instruct TeX of locations within the <replacement text> where it needs to feed-in the tokens representing the arguments provided by the user when the macro was called. In effect, “At this location, insert the tokens representing the user’s argument n, where n=1...9”.

Within the <replacement text> section of the stored macro-definition token list there will be an output parameter token corresponding to each #1, #2... #9 present in the original definition.

Showing the output parameter token in a TeX token list

If we look at our definition of \foo (\def\foo A#1\fake{123 #1}) we see there is only 1 macro parameter (#1) in the <parameter text> (A#1\fake) and subsequently only 1 macro parameter (#1) appears in the <replacement text> (123 #1): this results in just 1 output parameter token present in the token list representing the <replacement text>.

Note the following in the node list representing \foo’s <replacement text>:

the token immediately before the output parameter token represents a space character (category code 10, character code 32) because there is a space between the 123 and the macro parameter (#1) in the original definition of \foo;
the output parameter is the last token in the list: the next node has a special value of “null” (meaning “empty”) which is used to terminate the list: there are no more nodes after output parameter because it is the final token, indicating the end of the <replacement text> and thus the end of the macro definition.

Part 6

In Part 6 we use some detailed graphics to explain and explore the exact meaning of macro expansion and the consequences of TeX’s tokenization of macro arguments prior to feeding them into a macro’s <replacement text>.