How does \expandafter work: An introduction to TeX tokens

Part 1 Part 2 Part 3 Part 4 Part 5 Part 6

Background for `\expandafter`: TeX tokens and token lists

As a first step towards understanding how \expandafter really works, we’ll take a look at two components of TeX that are fundamental to the operation of \expandafter: TeX tokens (integer numbers) and token lists (lists of integers). Readers who would like to explore those topics in much more detail may be interested to read the following articles published by Overleaf:

How do TeX macros actually work?

Where did the token data come from?

Throughout this article we use actual token values calculated by TeX—data that is not usually accessible to users. For readers curious to know how this token-value data was obtained, Overleaf has custom builds of several TeX engines which we use for research. Those engines are modified to output information on TeX’s inner processing activities—helping to provide additional background material for some of the articles we produce. By showing/discussing numerical token values, our aim is to include details which, hopefully, help readers to better understand “TeX tokens”, making this important concept feel a little less opaque.

TeX Tokens 101 (and notions of expansion)

When TeX processes your input file it reads the text and converts individual characters and sequences of characters (commands) into so-called tokens. A TeX token is simply an integer value, calculated by TeX, which is used to “encode” data TeX needs to store about an item read-in from its current input source. Think of tokens as small parcels of information which “package together” data that TeX needs to record, ready for passing on to the next stage of processing. Internally, TeX operates on those integer token values—it does not use the actual letters, symbols, digits etc. originally contained in your input file: everything is converted to a token (an integer) and TeX works with those.

How TeX calculates token values

Here we look at token calculations used in Knuth’s original TeX, e-TeX and pdfTeX; for other TeX engines, particularly XeTeX and LuaTeX, their token calculations need to be slightly different to account for the use of Unicode but the calculation methods are similar to those described below.

Character tokens (non-active characters)

Calculation of token values for non-active characters is straightforward:

\[\text{character token} = 256\times \text{(category code)} + \text{character (ASCII) code}\]

Example: The letter A with category code 11, character code 65 is represented by TeX as the character token value $256\times 11 + 65 = 2881$.

You might encounter descriptions in TeX literature noting that once TeX has input a character, its category code value becomes “permanently bound” to that character: the above token value calculation shows why that is true. However, later in TeX’s processing it can, and does, “unpackage” character tokens to reveal the constituent (character code, category code) pair from which the token was constructed—when TeX does that “unpackaging” it still won’t alter that character’s category code, it merely uses that information during its subsequent processing.

Command tokens

TeX’s input processing and token generation recognize two types of command:

commands constructed from one or more characters that have category code 11;
single-character commands where that character’s category code is not 11: such as \$ or \#.

In both cases, TeX excludes the leading \ character and uses the character code of each remaining character to calculate an integer that TeX calls curcs (current control sequence). TeX then uses the value of curcs to calculate a token value for the command.

Commands made from characters with category code 11

Suppose our command (minus the leading \ character) is composed of a sequence of characters: $\mathrm{C_1C_2C_3...C_N}$ where $\mathrm{C}_i$ is the character code of each character—e.g., the character code of A is 65. TeX uses all of the character codes $\mathrm{C}_i$ to calculate the integer curcs (using a hash function). Once TeX has calculated the value of curcs it simply adds 4095 to that value, to give the token value:

\[\text{command token} = \text{curcs + 4095}\]

Note that the variable curcs plays an extremely important role in TeX’s inner processing activities.

Single-character commands

Tokens to represent commands such as \$, \# etc are subject to a slightly different calculation: the integer curcs is now the simpler calculation:

\[\text{curcs} = 257 + \text{character (ASCII) code}\]

For example, with \$, $\text{curcs}=257 + 36 = 293$. TeX again adds 4095 to this value (using $\text{command token} = \text{curcs} + 4095$) resulting in \$ having a token value $293 + 4095 = 4388$.

Compared to commands comprised of characters with category code 11, the only difference here is the way that TeX calculates the value for curcs.

Note: the integer curcs is not calculated for character tokens: it is always set to 0 when TeX is creating, or working with, character tokens.

Active-character tokens

TeX has the concept of so-called active characters: any character assigned to have category code 13. Tokens for this special class of characters are subject to a different calculation compared to regular characters.

The active-character mechanism allows TeX to create what are, in effect, single-character macros that you can use without having to prefix the active character with an escape character (typically \): the isolated character will trigger its macro behaviour. The canonical example is the tilde character (~) that TeX/LaTeX use for non-breaking spaces, which can be defined/enabled as follows:

\catcode`~=13 %assign category code 13 to ~
\def~{\penalty100000\ } % define ~ to act as a macro

When TeX subsequently reads a ~ character it will detect its category code is 13 and process it as a “mini macro”. To calculate a token representing an active character TeX applies another variation for calculating curcs:

\[ \begin{align*} \text{curcs} &= \text{character code} + 1\\ \text{active character token} &= \text{curcs} + 4095\\ \end{align*} \]

For example, the ~ character has character code 126, meaning its active-character token value representation is calculated as follows:

\[ \begin{align*} \text{curcs} &= 126 + 1\\ \text{active character token} &= 127 + 4095\\ &=4222\\ \end{align*} \]

Note that, like commands, tokens representing active characters are > 4095.

Consequences/notes

Any token whose value exceeds 4095 is immediately identifiable as a command token—hence TeX can very easily detect whether a particular token represents a character or a command.
For any token value, TeX can, when it needs to, “unpackage” that token to reveal the character (and its category code), or the command, originally present in your .tex file, stored in a macro definition or contained in some other token list.
The “intermediate” quantity called curcs—that TeX uses to calculate command token values—plays an important role in TeX’s low-level processing. curcs acts as an “index value” that TeX uses to store/lookup the current meaning of a command. Given any command token, $\mathrm{T}$, TeX simply subtracts 4095 to access the value of curcs: \[\text{curcs} = \mathrm{T}-4095\]

Incidentally, TeX does store the human-readable string of characters from which a command token is generated—this is essential for error reporting and other commands such as \string whose expansion is the human-readable version of a token value. However, those human-readable strings of characters stored inside TeX are only used/output when requested: for all other processing the token integer value is used.

Looking at some real tokens

Just to make the notion of tokens feel a little less opaque, we’ll define the following simple macro and take a look at the tokens TeX produces:

\def\hello{Greetings, from \TeX. \hskip 10pt}

For the \hello macro, TeX uses the characters h, e, l, l, o to calculate a value of 3745 for curcs; TeX then adds 4095 to create a token value of $3745 + 4095 = 7840$ (for Knuth’s TeX, e-TeX or pdfTeX).

After creating a token to represent \hello, the \def command causes TeX to read the subsequent tokens and use them to create a token list which is stored as the definition of the \hello command. That stored definition (token list) can then be retrieved whenever you tell TeX to use the \hello command.

The following table lists the actual token values created for each item (character, macro or primitive) contained in the \hello macro definition—this list of tokens (integers) is what TeX stores in its memory (as data structure known as a linked-list). Readers wishing to understand token lists in more detail are referred to the Overleaf article What is a TeX token list?

TeX token value	Item represented
2887	G
2930	r
2917	e
2917	e
2932	t
2921	i
2926	n
2919	g
2931	s
3116	,
2592	<space>
2918	f
2930	r
2927	o
2925	m
2592	<space>
5235	\TeX
3118	.
2592	<space>
7943	\hskip
3121	1
3120	0
2928	p
2932	t

In the token list above, the characters have category codes of 10, 11 or 12. For example:

<space> characters have category code 10 and character code 32, giving a token value of $256\times 10 + 32 = 2592$
, and . have category code 12 and character codes 44 and 46 respectively, giving tokens:

token for , $= 256 \times 12 + 44 = 3116$
token for . $= 256\times 12+ 46 = 3118$

Whenever TeX subsequently encounters the token value 7840 (representing \hello) it can, if required, “unpackage” that token to extract curcs through the simple calculation $\text{curcs} = \text{token value} - 4095$ (see above). Using the value of curcs TeX can consult its inner data tables to determine that command token 7840 represents a macro command. In addition, again via curcs, TeX can also look-up and retrieve the stored definition of \hello.

When TeX needs to fully process token 7840, i.e., to run the \hello macro, it no longer needs token 7840: that token has done its job—i.e., it triggered TeX to run the macro \hello. TeX can now discard token 7840 and fetch the tokens which represent the definition (token list) stored in memory. In effect, the \hello macro command (token 7840) has been removed from TeX’s current input source and replaced by tokens contained in the definition of \hello. What we have just described is one form of token expansion.

The \TeX command (token value 5235 listed above) used within \hello is itself a macro constructed from more tokens—so its definition is also stored as a token list:

TeX token value	Item represented
2900	T
19598	\kern
3117	-
3118	.
3121	1
3126	6
3126	6
3127	7
2917	e
2925	m
19597	\lower
3118	.
3125	5
2917	e
2936	x
6175	\hbox
379	{
2885	E
637	}
19598	\kern
3117	-
3118	.
3121	1
3122	2
3125	5
2917	e
2925	m
2904	X

If we were to replace the \hello command with the complete list of tokens from which it is built, including the \TeX macro, it would be a rather long list—i.e., if we also expanded the \TeX macro we would see:

A list of tokens stored in a TeX macro

Essentially, the single token value 7840 (for \hello) would, when fully expanded, produce a total of 51 tokens (integers) representing characters and primitive commands. In the following list the character or command represented by each token in enclosed in parentheses “(...)”—these are not directly stored in TeX’s token lists and are shown to assist the reader:

2887 (G), 2930 (r), 2917 (e), 2917 (e), 2932 (t), 2921 (i), 2926 (n), 2919 (g), 2931 (s), 3116 (,), 2592 (<space>), 2918 (f), 2930 (r), 2927 (o), 2925 (m), 2592 (<space>),  2900 (T), 19598 (\kern), 3117 (-), 3118 (.), 3121 (1), 3126 (6), 3126 (6), 3127 (7), 2917 (e), 2925 (m), 19597 (\lower), 3118 (.), 3125 (5), 2917 (e), 2936 (x), 6175 (\hbox), 379 ({), 2885 (E), 637 (}), 19598 (\kern), 3117 (-), 3118 (.), 3121 (1), 3122 (2), 3125 (5), 2917 (e), 2925 (m), 2904 (X), 3118 (.), 2592 (<space>), 7943 (\hskip), 3121 (1), 3120 (0), 2928 (p), 2932 (t)

To a human reader this is just a series of integers but to TeX it encodes a great deal of information.

Read tokens now and save them for later

As TeX reads your input there may be times when it needs (or is instructed) to delay fully processing some particular set of tokens. If directed to do so, TeX will, until it is told to stop, continue to create tokens from the input but store them for use later on—subsequently retrieving and processing them as part of its typesetting activities. Those stored tokens are saved as so-called token lists which are, in effect, TeX’s only (internal) token-data storage mechanism.

We’ve already seen examples of token lists—the \hello and \TeX macros listed above: the definition of those macros are stored in TeX’s memory as lists of tokens. TeX will only process (action) such token lists when you decide to call those macros. Remember too that each token (integer value) encodes sufficient information for TeX to easily work out whether each token stored in a macro definition represents a character or a command.

Saving tokens with token registers

Another example of token storage is the explicit creation of lists of tokens that are saved in so-called token registers: dedicated internal storage areas that TeX provides for users to store token lists. The TeX primitive \toksdef is one way to use token registers; for example, to use token register 100 and reference it using the command \mylist:

        \toksdef\mylist=100
        \mylist={some \TeX{} tokens here}

\mylist is, in effect, just a name that you assign to a list of tokens stored in register location 100. Similar to a macro definition, \mylist contains the following token list:

TeX token value	Item represented
2931	s
2927	o
2925	m
2917	e
2592	<space>
5235	\TeX
379	{
637	}
2592	<space>
2932	t
2927	o
2923	k
2917	e
2926	n
2931	s
2592	<space>
2920	h
2917	e
2930	r
2917	e

Note: to terminate the \TeX macro and prevent it from absorbing the following <space> character we used a pair of braces {} immediately after \TeX—the tokens for { (379) and } (637) are stored in the token list. Another option is to use a “control space” token \<space> which would appear in the token list as shown below (in bold):

TeX token value	Item represented
2931	s
2927	o
2925	m
2917	e
2592	<space>
5235	\TeX
4384	\<space>
2932	t
2927	o
2923	k
2917	e
2926	n
2931	s
2592	<space>
2920	h
2917	e
2930	r
2917	e

Note that the <space> character is represented as a character token with value $256\times 10 + 32 = 2592 $ but \<space> is treated as a single-character command token (value 4384) which is calculated using the formulae given above:

\begin{align*} \text{curcs} & = 257 + \text{character (ASCII) code}\\ & = 257 + 32\\ &=289\\ \text{command token for} \left<\text{\\space}\right> & = \text{curcs + 4095}\\ & = 289+4095\\ &=4384\\ \end{align*}

In essence \mylist={some \TeX{} tokens here} says to TeX: please scan my input file to convert the following characters/commands to tokens and save them for use later on. TeX will oblige and store those tokens in a memory location you can access by writing \the\mylist, instructing TeX to insert a copy of the tokens contained in token register \mylist. TeX engines include a number of primitive commands that explicitly generate and store token lists—such as \everyjob, \everypar, \mark, and many others.