How TeX macros actually work: Part 3

Part 1 Part 2 Part 3 Part 4 Part 5 Part 6

Time to pause!

Before moving on to the next part of this story we need to have a recap: remind ourselves of where we are going and gather our thoughts to make sure all the key ideas are in place. Just a reminder, our worked example is based on the assumption that TeX has read a line of text containing Hello World \jobname and that TeX is typesetting this to build a paragraph.

The ultimate objective

Our goal is to develop a better (deeper) understanding of the nature of TeX macros and how they work. However, to get there we first need to understand how TeX reads an input file and processes the characters within it. Here is a summary of the topics covered so far.

TeX reads (scans) each character in your input and, for every character, TeX has two pieces of information:

character code: an integer number used to identify that character, e.g., when stored in a .tex input file;
category code: another integer, internal to TeX, that it uses to assign meaning to each character read from the input.

As soon as a character is read-in by TeX, each character’s category code becomes permanently associated with that individual character through the creation of a character token:

TeX uses a simple formula to “package together” a character code and its corresponding category code into an integer called a character token.
You can change the meaning of any character which TeX has not yet read-in by assigning a different category code to any character whose behaviour you wish to change—i.e., modify the way TeX treats that character.
Re-defining (re-mapping) category codes is achieved using TeX’s primitive \catcode command.

When TeX sees a category code of 0 it will switch into a special scanning mode and start to look for a command: either a (potentially) multi-letter control word or a single-character control symbol.

So far, we have looked at TeX’s input-scanning process as it identifies individual characters and uses the category code of each character in order to work out what to do next. Some characters are just ordinary items of text for typesetting (e.g., category code 11) but we’ve also seen space characters (category code 10) and escape characters (category code 0). There are other category codes which, for sake of brevity, we’ve not looked at—such as category code 1 (“start group”, e.g., {), category code 2 (“end group”, e.g., }) and others. Each category code plays its own role in TeX’s input scanning and subsequent processing by the software processes/algorithms inside TeX.

Tokens: a quick review

The concept of “tokens” is central to the way TeX works: you will see “tokens” mentioned or referenced throughout TeX-related books, articles and online communities, so it is worth briefly reviewing this topic—you can find more detail in a previously-published aricle What is a "TeX token"?

We’ve already seen that TeX converts input characters into tokens by combining the character code and category code into a single compostite integer. TeX does something similar for commands: using the name of the command it calculates an integer called a command token (we will explore this in more detail). As a guide, you can think of tokens as TeX’s method for “packaging” items it has read from the input, making them ready for dispatch into the next stage of TeX’s processing. Having all items (characters or commands) neatly wrapped into a single numeric representation makes it easier to process them further down the chain. For example, when TeX wants to store some of your input to use later on, such as a macro definition, TeX just needs to save your macro definition, however complex, as a series of integers, where each integer is a token representing a character or a command that forms part of (is contained within) your macro’s definition.

So, what next?

In the final section of Part 2 we saw how an escape character (category code 0) switches TeX into a special processing mode where it looks for the name of a command. In our example, TeX detected the string of characters jobname and we finished Part 2 at the point where TeX was going to “do something” with that string of characters (name of a command). In this part we will look, in detail, at what TeX does next.

Once TeX has identified that a particular sequence of characters in your input file represent the name of a command (here, jobname) TeX might, depending on what it is doing, need to execute that command. We say “might need to” because there are times when TeX won’t immediately try to execute a command: for example, when it is defining a macro (TeX is building tokens lists)—topics we will discuss later. However, we’ll continue to follow our example where TeX is typesetting a paragraph and will, in this situation, need to execute \jobname.

From a string of characters to running a command: how?

Firstly, let’s revisit Graphic 5b from Part 2 in which TeX has identified that a particular string of characters within the input constitutes the name of a command: jobname. Graphic 5b indicates that TeX has to “Check internal tables...”. What does that actually mean?

TeX looking for a command name

Another, more detailed, description of how TeX “Checks internal tables” to make the transition from having a string of characters (e.g., jobname) to working out exactly what the command is, and what it means, can be found in a previous article What is a "TeX token"? Here, we’ll summarize the key ideas whilst trying to avoid excessive duplication.

Let’s start with an analogy. Suppose you are reading a book and come across an unfamiliar word: what do you do? Today, it’s almost certainly “reach for Google” but let’s assume you prefer an older method: you reach for a dictionary that lists words and provides their meaning(s). TeX has an analagous mechanism: an internal “dictionary” which lists all the commands currently known to TeX—and the “meaning” of those commands. By “meaning” we are referring to what type of command is it: what does it do, plus any other information TeX might need to run that command. Note too that the term “command” includes any TeX/LaTeX macros written by users/TeX programmers and the hundreds of built-in primitive commands.

Continuing our dictionary analogy. When, as a human reader, we need to lookup the meaning of a word we’ll search the dictionary using the alphabetical listing of words provided by the dictionary—but, of course, TeX doesn’t quite work like that. Going back to our original jobname example, how does TeX find, within its “dictionary”, the “meaning” of jobname—and what does that “meaning” actually provide to TeX?

Rather than providing an internal “alphabetical listing” of all the commands that TeX knows about, it does something a bit different. TeX converts the entire sequence of characters—present in the name of a command—into a single integer, which will be used to identify (represent) that command. Internally, TeX maintains a big “dictionary” of all known commands into which it save/stores the integers calculated from command names—note that dictionary doesn’t store the actual command names themselves as sequences of letters (called strings). TeX uses that dictionary for all of its built-in commands (primitives) and it will use it to store details of any macro (command) created by users: the name of your macro is turned into an integer and that integer is “registered” inside TeX’s dictionary.

Each time TeX detects a command used in your input, and needs to know something about that command, it converts the series of characters in the command name to an “equivalent” integer and uses that integer to look-up the command within its “big dictionary”. Programmers among you might like to know that TeX uses a form of hash function to do that conversion.

Diagram of a hash function

Graphic 6: From characters to command meaning

The following graphic shows the journey that a command undergoes as TeX converts the string of characters into an equivalent integer, which is calls curcs, and uses that integer to lookup the command’s meaning in TeX’s “big dictionary”. The result of that lookup is two pieces of information: two integers, called curcmd and curchr, that TeX can use to work out exactly what the command does and how to subsequently execute it.

TeX converting a string of characters into an equivalent integer to lookup the command’s meaning

Internally, TeX maintains a variable called curcs (current control sequence) which is used to store the integer value of the command that TeX is currently working on—i.e., curcs stores the integer calculated from the name of the command. That’s not quite the whole story because there is one more detail: if TeX has just read/processed a character, not a command, it will set curcs to a value of 0, to remember that the last thing read-in was a character, not a command.

What commands mean to TeX

If we look at the set of built-in commands provided by TeX engines we can see that some of those commands are closely related: they perform similar tasks; for example, there are 4 primitive commands that all TeX engines use to define (create) macros: \def, \gdef, \edef, \xdef. Those 4 commands all define macros but, of course, each one does it slightly differently. If we think about this from a programming point of view: here have 4 macro-definition commands that, broadly, do the same thing but we need to select between them in order to cater for their individual behavior.

To deal with this, TeX assigns two values to every command and those two values are what TeX understands as a command’s “meaning” (its role/what it does)—those two values are internal to TeX, deep within the software, and part of the “inner machinery” that is not accessible to users. Every TeX command, whether it is a built-in primitive, or a user-defined command, is assigned two values which, to TeX, define/classify its behaviour—what it means to TeX. When TeX uses its “big dictionary” to look-up a command, it will find those two vital pieces of information:

command code: a sort of “general classification” indicating what “type” of command it is—such as a “macro definition” command (one of \def, \gdef, \edef, \xdef); a “box making” command (one of \hbox, \vbox or \vcenter) and so forth for the hundreds of commands that TeX engines support. Macros (user-defined commands) are also assigned a command code.
command modifier: This is ancillary information that provides TeX with specific information about a command. Macros (user-defined commands) are also assigned a command modifier—though, with macros, the command modifier plays a slightly different role than it does for primitives (for macros, the command modifier indicates where the macro definition is stored in memory).

Taken together, the command code and command modifier uniquely identify each command. Here are the command codes and command modifiers for the macro-definition commands as used by Knuth’s original TeX software—note that other TeX engines may use different values but they follow the exact same principle:

Command	Command code	Command modifier
`\def`	97	0
`\gdef`	97	1
`\edef`	97	2
`\xdef`	97	3

Recap: making sense of all these variables/values

At this point we’re awash with lots of information on values, variables, command values and all sorts of detail—it can quickly become confusing so let’s take stock of what we know. When TeX reads something from your input it is either a character or a command. Whenever TeX reads something from the input it needs to store information about what it has just read (scanned):

For characters: it needs to record the character code and the category code. It also needs to create and store the token value that TeX calculates using those values.
For commands: TeX needs to know the numeric equivalent, curcs, that it calculated from the command name. It might also need to store the “meaning” it retrieved by looking up the command in TeX’s “dictionary”: the command code and the command modifier. On top of this TeX will also need to calculate a token value that represents this command.

Yes, it’s confusing: lots of variables and token ideas floating about, so let’s try to make sense of this.

Internally, TeX uses four global variables to store information about the latest item that TeX has read-in (or is currently “working on”)—we won’t discuss those variables in great detail but knowing of their existence helps to provide a little more background to understand what really happens:

curcmd: (current command) an integer variable. It is used to store the current command value for the command being being processed or it stores the the current category code of the character being processed;
curchr: (current character) an integer variable, but what it stores depends on what TeX has just read from its input:

character: If the most-recently read-in item is a character, curchr stores the current character code.
command: If the most-recently read-in item is a command, curchr stores the command modifier: additional information TeX uses to support/clarify curcmd—because, as we saw above, some commands share the same value of curcmd

curcs: (current control sequence) an integer variable which stores the value calculated from the string of characters in a command name. curcs = 0 if the last item read was an individual character and not the name of a control sequence (a command name);
curtok: (current token) an integer variable that holds the value of the current token—which is either a command token or a character token.

Here is the above information displayed as a table:

Global variable used inside TeX:	When TeX scans a character:	When TeX scans a command:
curcmd	Stores the category code of the current character	Stores the command code—which identifies the “type” of the current command
curchr	Stores the character code of the current character	Stores supplementary data (called the command modifier) which provides additional information about the current command
curcs	0	A non-zero positive integer that is calculated (via a hash function) using the string of characters present in the command name. It is used to access TeX’s “dictionary” to look-up the current meaning of a command—to retrieve its command code and command modifier.
curtok	For 8-bit TeX engines, a character token is calculated using the formula: \[\text{curtok}=256\times \text{curcmd} + \text{curchr}\] where \(\text{curcmd}\) is the character’s category code and \(\text{curchr}\) is the character code	For 8-bit TeX engines, a command token is calculated using the formula: \[\text{curtok}=4095 + \text{curcs}\]

Further notes on the current token

For characters, the maximum possible token value is obtained using the largest category code (15) and the largest character code which, for 8-bit TeX engines, is 255. In theory (for 8-bit TeX engines), the maximum character token value, \(\text{curtok}_{\text{max}}\), is:

\[ \text{curtok}_{\text{max}}= 256\times 15 + 255 = 4095\]

We note “in theory” because category code 15 is used to represent an “invalid character” which causes TeX to generate an error: an invalid character will never get past TeX’s input scanning process and so it won’t ever become a character token.

For commands the current token (\(\text{curtok}\)) is calculated from \(\text{curtok}=4095 + \text{curcs}\) but for commands \(\text{curcs}\) is always non-zero, thus TeX can easily determine what a token represents:

If \(\text{curtok} > 4095 \) then it is a command token;
If \(\text{curtok} < 4095 \) it is a character token.

In effect, TeX uses tokens, a simple integer value, to “package” all the information it needs to know about an item read from the input.

Part 4

In Part 4 we explores a range of example macros to demonstrate the role and purpose of a macro’s <parameter text> section to act as a “token template” which can be constructed through the use of delimiter tokens.