A New Series of Articles: TeX Tokens and Related Concepts—But Why (and How)?

By Graham Douglas

This introductory article sets out to explain why this series was written, what I hope to achieve and to provide background information on the techniques used to explore TeX tokens through observation of the inner workings of a TeX engine. This article page is also designed to contain the links to all articles in the series and will be updated to provide those links as each new article is published.

Links to the articles

Each article will contain a set of links to other articles in this series.

Background to this article series

The motivation to write this series of articles arose through reading material about TeX which explained many of TeX’s activities through the concept of “tokens” together with TeX’s “tokenization process”, “token lists” and related concepts such as “macro expansion” and “expandable commands”. Whenever I encountered TeX-related explanations phrased in terms of “TeX tokens” the same question kept coming to mind: What, precisely, is a TeX token? I needed to find out.

The scope and content of the first article, What is a TeX token?, is, by its very nature, quite “close to the metal” as programmers might say and there’s no doubt that “TeX tokens” could be classified as a pretty arcane topic to write about: so why bother? Ultimately, you take a view—or, perhaps, a leap of faith—that other people may also have been puzzling over the same topic and that there’s scope for an article or two to fill in some gaps. My aim is to provide some useful background explanations which can complement other material you might be reading and, hopefully, may help to better understand some key concepts which arise as you learn about TeX and explore macros and programming.

Clearly, within the confines of blog articles we can only skim the surface—it’s simply not practical to attempt an explanation of all salient topics or to dive into the murkiest waters. Of necessity, I will skip considerable detail and walk the fine line between over-simplification and pushing analogies to breaking point.

“Write articles you’d like to have read” is a useful guide and one I’ve tried hard to apply as I wrote this series.

Having asked the question, now what?

The immediate challenge was clear: how do you find out about TeX tokens because such details (of tokens, tokenization etc) are buried deep within the software code of TeX engines—you’re not really supposed to worry about it unless, of course, you’re actually interested in those details.

One way to explore answering these questions is to try to read through TeX’s original source code in tex.web—through running WEAVE to extract the TeX documentation—or by buying a copy of the book Computers & Typesetting, Volume B: TeX: The Program. I bought a copy of the printed book! It’s certainly extremely helpful to have TeX’s source code published in book form and there are, of course, many useful explanations throughout. However, Knuth’s TeX is written in Pascal and, naturally, the Pascal source code is documented using Knuth’s literate programming methodology—presenting the code in small, bite-sized, chunks. It’s easy to appreciate how Knuth’s approach to documentation really helps for a piece of software as complex as TeX but reading the book does entail quite a lot of cross-referencing and page-hopping.

Although helpful, the book alone wasn’t quite sufficient (for me) to gain a better understanding of what happens as TeX creates “tokens”—the topic I was particularly interested in. There’s only one way to truly find out: build the TeX program, execute it on a small TeX file and literally watch the code execute as TeX scans and reads the input. The details of building TeX from source are somewhat arcane—converting Pascal to C—but there’s a short description in the next section together with a link to a personal blog post that goes into more detail.

Unlike XeTeX and LuaTeX—which can process text in UTF-8 format and support Unicode text encoding—Knuth’s TeX is an 8-bit engine, meaning that it assumes input characters are in the range 0 to 255. Although this is an important distinction, it does not materially affect our discussion of TeX tokens because we cover topics and principles common to all TeX engines: they are at the very heart of the software.

How can you study TeX tokens?

Unpicking the route from input text to TeX tokens has, for me, been quite a journey—I should confess that varying degrees of confusion were semi-permanent companions along the way: TeX is such a complex piece of software.

For some years (since around 2009) I have routinely compiled the latest version of LuaTeX from its source code—a process that is quite straightforward thanks to the absolutely superb way that LuaTeX’s source code is distributed. Based on that experience I became interested to better understand how to build Knuth’s original TeX from its source code—a very different proposition because TeX is written using Knuth’s literate programming methodology. That personal build of TeX, on Windows but using open-source compilers and toolsets, was undertaken outside of the TeX Live distribution and is a standalone project. It also required building the toolchain needed to convert tex.web into a C program which could be compiled and then executed in a debugger to see what TeX actually does as it processes the characters of input.

Knuth’s original TeX was used instead of pdfTeX, XeTeX or LuaTeX because I needed a version of TeX that was closest to the printed source code in the book TeX: The Program. That book was first published in 1986 and although TeX has undergone some updates since that time, the latest version of TeX (3.14159265, released in January 2014) is certainly sufficiently close to the source code contained in the book.

Reflecting Knuth’s literate programming methodology, TeX’s source code is distributed in a text format called WEB: a mixture of TeX documentation and Pascal source code. The basic idea is that you use two utilities called TANGLE and WEAVE which process WEB files to extract either the TeX documentation or the Pascal source code:

  • TANGLE extracts the Pascal source code from a WEB file
  • WEAVE extracts the TeX documentation from the WEB file

However, before you extract the Pascal source code you have to pre-process Knuth’s tex.web file to apply a number of changes that enable conversion of TeX’s Pascal code into C code using a process called Web2C. This pre-processing step is referred to as applying change files.

Knuth’s original code file (tex.web) must not be directly altered in any way; instead, you apply modifications using so-called change files (extension .ch) which contain the changes you wish to apply to the main .web file—such as tex.web. Change files are merged with Knuth’s original source code—using an additional utility program called TIE—to create a file called, say, mytex.web which you process with TANGLE to extract the Pascal code into mytex.pas. Once you have an appropriate Pascal source file you can apply the final steps in the Web2C process to convert it into a C source code file that you can compile into an executable TeX program. If you want to read about the rather convoluted Web2C conversion process there are further details on my personal blog site.

The end result is a TeX program which can be executed using the free and excellent Eclipse IDE for C/C++ to single-step through TeX’s source code (in C) and watch what happens as it scans through your input. It’s definitely not the most entertaining of pastimes because the C code is machine-generated and, in places, extremely difficult to follow (TeX’s source code makes very generous use of GOTOs and global variables). The book TeX: The Program is still invaluable to help navigate the C source, even though the book contains TeX’s source code in beautifully typeset Pascal code.

Just to round-off the discussion, here is an example screenshot showing TeX being executed via the Eclipse IDE with execution paused on the function getnext()—which is at the heart of TeX’s token-generation processes.

Stepping through TeX’s C source code using the open source Eclipse IDE for C/C++.

Conclusions and a Thank You

Writing the first article and putting together ideas for future blog posts in the series has certainly been quite time-consuming. I’m extremely grateful to John Hammersley and Mary Anne Baynes at Overleaf for their support of this series idea and for allowing me to take the time required for additional background research. It is my hope that this series of articles successfully identifies and addresses topics of common concern and proves to be of value to those who read them.

Graham Douglas

Graham Douglas

Content Development Editor

I've worked in scientific/technical publishing for over 20 years (Senior Publisher, book/journal production, and programming). Now relishing the opportunity to combine my interests in publishing and TeXnology. I work from home, ably assisted by our two delightfully inquisitive Bengal cats: Oscar and Alfie.