Pandora’s \hbox: Using LuaTeX to Lift the Lid of TeX Boxes

Introduction

Boxes and glue are two key concepts which provide the foundation for TeX’s typesetting model and capabilities. Building on the introductory material in a previous post, Boxes and Glue: A Brief, but Visual, Introduction Using LuaTeX, this extensively-illustrated article examines boxes and glue in more detail. We also present a new LuaTeX-based Overleaf project that enables you to explore the deep inner structure of TeX boxes—providing insights which will help you to truly understand their behaviour. Creation of the Overleaf project was greatly facilitated by the work of Patrick Gundlach, so we offer our thanks to him.

Why choose LuaTeX?

Firstly, it is worth re-stating the difference between LuaTeX and LuaLaTeX:

LuaTeX is the name of an executable TeX-based typesetting engine;
LuaLaTeX refers to the use of the LaTeX macro package with the LuaTeX engine.

This distinction is extremely important because, in this article, we are exploiting the built-in capabilities of the LuaTeX engine itself, and not just leveraging the features/functionality of commands provided by the LaTeX macro package.

Readers who are uncertain of the difference between a TeX engine and the LaTeX macro package may want to read one of our previously-published articles, What's in a Name: A Guide to the Many Flavours of TeX, which explains those differences in some detail. That same article also discusses “TeX” as a programming language and that TeX-based typesetting engines (e.g., pdfTeX, XeTeX and LuaTeX) not only vary in their features and functionality, but also have variations in the “flavour” of the TeX language they support. This brings us to our choice of LuaTeX. In addition to supporting a TeX-based programming language, LuaTeX also has the Lua scripting language embedded into into it—providing access to a simple, but very powerful, conventional programming language. Through Lua, and LuaTeX’s built-in functionality, you can explore and control the typesetting activities of LuaTeX in ways that no other TeX engine provides—and this includes the ability to probe the inner structures of TeX boxes; hence LuaTeX is the ideal (only) choice for this article and accompanying Overleaf project.

pdfTeX/XeTeX vs LuaTeX: in pictures

The following schematics are intended to highlight an important comparison between the design of pdfTeX/XeTeX and LuaTeX. Both pdfTeX and XeTeX do, of course, allow users to write TeX code that can influence typesetting behaviour; however, the deeper internal structures contained within those TeX engines, and low-level data constructed during the typesetting process, are mostly inaccessible to user commands and macros. In that sense, they are relatively closed systems when compared to LuaTeX.

pdfTeX/XeTeX

LuaTeX

LuaTeX introduces a new primitive command called \directlua{...} through which you can write code that not only gives full access to the Lua language but also allows you to extend LuaTeX’s capabilities by writing plug-ins using languages such as C and C++. On Windows, such plugins are called Dynamic Link Libraries (.DLL); on Linux they are known as Shared Object Libraries (.so). However, LuaTeX’s real power is derived from a huge set of built-in Lua functions that provide access to the internals of LuaTeX—enabling extremely sophisticated control and programming of TeX-based typesetting. A set of such functions is known as an API (Application Programming Interface) and it is through LuaTeX’s API that you use Lua programs to communicate with its TeX-based typesetting engine and data structures.

With LuaTeX’s \directlua{...} command you can, for example, access low-level internal TeX data structures hidden from view within other TeX engines. In addition, you can use Lua scripts to perform all sorts of programming calculations, string manipulation etc. and pass the results back to TeX: the possibilities are almost endless. However, this article is not intended to be a detailed exposition or tutorial on LuaTeX—though it is tempting to give examples which convey the incredible versatility of this astonishingly powerful TeX engine.

Boxes and glue: A brief reminder

As introduced in the article Boxes and Glue: A Brief, but Visual, Introduction Using LuaTeX boxes and glue are two key concepts that underpin TeX’s typesetting capabilities. The following diagram is offered as a very brief aide–mémoire on the behaviour of TeX’s horizontal and vertical box types. Note: horizontal boxes can, of course, contain text typeset in right-to-left languages, such as Arabic or Hebrew, which means the direction of box growth can be opposite to that shown for the horizontal box in the diagram below.

TeX primitives for box construction

Today, most people prepare their TeX documents using the LaTeX macro package which is designed to provide commands that insulate users from much of TeX’s low-level language—its so-called primitives—the core commands built into TeX engines (see the article What's in a Name: A Guide to the Many Flavours of TeX for a discussion of TeX primitives). The LaTeX macro collection provides a variety of macros for box creation and storage (saving) but if you strip away all the macro code you’ll find there are just 4 low-level primitive box-construction commands:

For creating horizontal lists:

\hbox{...}

For creating and stacking vertical lists:

\vbox{...}
\vtop{...}
\vcenter{...}

We won’t be explaining how to use all these box commands because there are plenty of examples and tutorials elsewhere on the web or in TeX/LaTeX books—but we will be taking a look into how boxes are represented and stored inside of TeX data structures.

Glue: flexible spacing

Glue is, in effect, a form of spacing used by TeX to space/position items horizontally or vertically. As a TeX user, we can instruct TeX to insert some glue that is of a fixed size or we can use glue that is flexible—having as much flexibility as we need, either to stretch or shrink depending on our requirements. One of TeX’s commands to create glue for horizontal spacing is called \hskip which takes the form

\hskip <natural width> plus <amount to stretch> minus <amount to shrink>

plus and minus are TeX keywords but you don’t need to use them for every glue. If plus or minus are absent then the corresponding <amount to stretch> or <amount to shrink> is assumed to be zero. For example, \hskip 3pt inserts a fixed-width glue with no stretch or shrink component.

For now, think of <amount to stretch> and <amount to shrink> as our recommendations to TeX because the exact amount of stretching or shrinking will be calculated by TeX.

To help with these ideas, here is a diagram which represents glue as a spring. The <natural width> is the length of the spring when there is no tension (stretching) or compression (shrinking). The <amount to stretch> and <amount to shrink> are shown relative to the natural length of the spring.

An `\hbox` example

Suppose we want to create an \hbox{...} containing just the letters A, B, C and D and we need this box to be 100pt (100 TeX points) wide. In addition, it is safe to assume that the total width of those four characters is far less than 100pt, indicating that TeX need some way to fill up the remaining space within the box: we’ll use some glue to do that. However, because we do not know the exact amount of glue required to fill the box it is advisable to add some flexible glues and let TeX take care of calculating the amount of space those glues need to occupy. In the following code snippet, note the use of “%” to suppress interword spaces arising from the end-of-line characters.

\hbox to100pt{%
A\hskip4pt plus3pt minus 2pt B%
\hskip 0pt plus 2fil C%
\hskip 0pt plus 2fill D%
\hskip 0pt plus 3fill}

The resulting box looks like this (enlarged for clarity):

This \hbox is overlaid with dashed boxes (in red) to indicate the width of the characters (as TeX sees them). For typesetting purposes, characters are considered to be small boxes and the amount of glue required to fill this \hboxis determined (calculated) by taking into account the widths of each character.

It turns out that TeX did not stretch or shrink the glue between A and B (set to 4pt) and there is no glue between B and C (set to 0pt). However, the glue between C and D and the glue between D and the end of the box have both stretched considerably because those glues have the most flexible stretch component—in effect, those glues absorbed all the stretching required to fill the box.

Back to LuateX

So far we’ve explored boxes and glue and seen that LuaTeX allows access to internal TeX structures hidden from view with pdfTeX and XeTeX. It’s time for an example to make this more explicit but, firstly, we need to briefly acquaint ourselves with the way that TeX stores boxes in its memory—we’ll start with an analogy.

How TeX stores boxes in memory: an analogy

Suppose, for some reason, you needed to create a data model which describes a physical box. What data might you choose to provide such a description? One approach you could adopt is to split the information into two parts: data about the physical box itself and data which provides a list of the box contents. So, our simple model might look like this:

Data about the physical box (“metadata”):

width
height
depth
weight
colour
type (wooden, plastic, cardboard)

Data about box content: some form of list which describes the items that it contains—probably listed in no particular order.

And there is a very close analogy with the way TeX stores boxes.

How TeX stores boxes in memory: hlists and vlists

Internally, TeX creates “containers” called hlists (horizontal lists) and vlists (vertical lists) which represent hboxes and vboxes respectively. These hlist/vlist objects provide a collection of “metadata” about the box, plus they provide access to the list of objects that the box actually contains—that list is called a node list. Unlike a physical box, where you can place objects inside it in any order, for TeX the order of box contents is extremely important—they are items to be typeset. If you have any programming or computer science background you won’t be surprised to learn that the objects within a TeX box are stored, and have their order of creation preserved, using a so-called doubly-linked list. We won’t discuss linked lists any further detail because the web abounds with tutorials, examples and explanations.

The concept of nodes and node lists is a fundamental aspect of how TeX works but for the purposes of this article we’ll give just a brief outline. Nodes are, in essence, a sort of “mini container” and (as of LuaTeX 1.04) there are some 50 different types of node: reflecting the inner data types and components that LuaTeX uses for typesetting. For example, there are nodes to represent: glyphs (arising from “characters”), glue, horizontal/vertical rules, penalties, “whatsits”, kerns and so forth. All typeset material will, eventually, become part of a huge node list and LuaTeX gives you direct access to those inner data structures. LuaTeX also lets you add, edit, amend or create node lists so that, for example, you can create boxes directly inside Lua code without having to use any TeX code at all. However, writing about that is for another day.

An simple example of `\directlua{...}` in action

The following example creates an \hbox and saves it in box register 0. We then report the box’s width using traditional TeX code and obtain the same information using a second method via \directlua{}. Here, we run a small Lua script which accesses TeX’s internal box storage area to obtain the box’s width—of course, the two values are identical: 2412092sp (sp=scaled point: 65536sp = 1 TeX point). Ultimately, in this extremely simple example, the TeX code and Lua code both examine the same internal data structures to obtain the box’s width, but it is through the direct access route that LuaTeX opens the door to a wealth of information and control that is not available with other engines.

\documentclass{article}
\begin{document}
\setbox0=\hbox{A\hskip 5pt B\hskip 10pt C}
\fontsize{18}{22}\selectfont
\noindent Using \TeX{} code, box 0 has width \number\wd0\relax \space sp\par
\noindent We can also use Lua and call one of Lua\TeX's functions to get the same
information.\vskip10mm
\noindent From Lua code, box 0 has width 
\directlua{
local boxwidth = tex.box[0].width
tex.print(boxwidth.." sp")
} which, of course, is identical to the value obtained from \TeX{} code.
\end{document}

Putting it all together: An Overleaf project

We’ve noted that, internally, TeX represents boxes as “containers” called hlists/vlists which store “metadata” about the box and provide access to the list of components from which the box is constructed. Using LuaTeX you can access the box “metadata” and the list of items contained in a TeX box: glyphs, glue, penalties, other boxes, and so forth. Using Lua scripts, it is possible to examine a box sitting in TeX’s memory and draw a detailed representation of what that box contains. A suitable representation of a TeX box and its content is achieved using node graphs and we have prepared an Overleaf project which does that by leveraging an excellent Lua script written by Patrick Gundlach (see credits). We won’t describe the detailed processes required to examine boxes and generate node graphs—except to note that any program/script which processes TeX boxes has to be recursive because boxes can be nested: i.e., you can have hboxes within vboxes, within hboxes… combining all box types to a very deep level of nesting.

What does the project provide?

It implements just 1 command called \dobox{box command}, for example:

\dobox{\hbox to100pt{%
A\hskip4pt plus3pt minus 2pt
B\hskip 0pt plus 2fil
C\hskip 0pt plus 2fill
D\hskip 0pt plus 3fill}}

The \dobox{...} command performs a number of tasks:

within your document it typesets the verbatim TeX code for your box;
it generates an SVG graphic of the TeX box—you can embed this in a web page (as we have done within this blog post);
it generates an SVG graphic of the node list—which you can also embed into web pages (as we have done within this blog post);
it outputs a PDF graphic of the node list which is then imported into the main PDF document produced by the project.

Node graphs can very quickly become extremely large due to the enormous amount of data that LuaTeX need to store in order to represent complex TeX boxes—such as the page currently being constructed, or typeset mathematics. For larger node lists, the imported PDF graphic may be clipped by your document’s page boundary—if you want to view a large node graph you can download a ZIP file of the project and extract the PDF graphic of interest. When you download the project's ZIP file make sure to choose “Input and Output Files” from the drop-down option list:

Graphics from the Overleaf project: A brief description

Before we show some examples, it is worth making a few observations on the graphics produced by the Overleaf project—we’ll use the same \hbox example mentioned earlier in the article. Here it is wrapped up in the project’s \dobox{...} command:

\dobox{\hbox to100pt{%
A\hskip4pt plus3pt minus 2pt
B\hskip 0pt plus 2fil
C\hskip 0pt plus 2fill
D\hskip 0pt plus 3fill}}

Here is the \hbox produced by TeX—for clarity, the box has been scaled-up but the border is included in graphics produced by the Overleaf project.

Here is an annotated SVG diagram of the node list representing the above box—annotations were added to highlight the box “metadata” and the list of objects it contains: those annotations are not present in the graphics produced by the Overleaf project.

If you look at the “metadata” section you might observe some unfamiliar parameters:

glue_set
glue_sign
glue_order

These parameters are the settings used by TeX to calculate how much the glue has to stretch or shrunk within this box and are just one example of data that you can easily obtain via LuaTeX but not with other TeX engines. Note that glue nodes contained within the box components retain the original glue values we typed in to create the box. This is essential because TeX provides the commands \unhbox, \unvbox, \unhcopy, \unvcopy which “unbox” the box’s contents and release them back into the input stream to once again take part in typesetting operations. It is only when TeX finally outputs (ships out) the box to a PDF or DVI file that glue_set, glue_sign and glue_order are applied to any glues contained in the box—to calculate the actual amount of stretching or shrinking required to position components within the box and then to generate appropriate PDF data or DVI opcodes.

Another parameter listed in the “metadata” is shift: this is the value of box displacement resulting from applying TeX commands:

\raise, \lower (applied to an \hbox);
\moveleft, \moveright (applied to a \vbox).

In our example, shift is 0pt because we did not displace the \hbox from its natural position.

The Overleaf project also outputs node graph diagrams in PDF format: here is a link to download a PDF file version of the node graph above.

How does the Overleaf project create those graphics?

The Overleaf project leverages the ability to run software tools and utilities installed on Overleaf’s servers—see this blog post for more details and a sample project. To produce an SVG graphic representing a TeX box, the box’s TeX code is written out to a small file which is then typeset with pdfTeX to generate a DVI file—note that the pdfTeX program is executed by LuaTeX through the use of a few lines of Lua script. That DVI file is converted, on-the-fly, to SVG using the dvisvgm utility—which is shipped with the TeX Live distribution installed on Overleaf’s servers. dvisvgm is executed with command line option -n to ensure that any typeset text is converted to lines/curves so that correct rendering of the SVG file does not depend on TeX fonts being installed.

To create the node graphs we use a Lua script called hiviznodelist.lua which is based on work by Patrick Gundlach. That script writes out a so-called .gv (Graphviz) file which is a text file containing a node graph described in the dot language. The .gv file is processed by a utility program called dot which outputs a node diagram in both PDF and SVG file formats.

Project examples

Here are some additional examples with SVG graphics produced using the Overleaf project. Boxes containing a lot of text (e.g., in a \vbox), or complex mathematics, will produce enormous node graphs—if you explore the Overleaf project, it is advisable not to use unnecessarily complex boxes to demonstrate the features of interest to you.

`\vbox to 25pt{A}`

This example demonstrates the effect of putting text directly into a \vbox: note that the node structure is quite complex, even for such a simple box. The reason for this complexity is that text placed directly into a \vbox causes TeX to undertake linebreaking. You can see that the \vbox is 345pt wide: the value of \hsize at the time this box was created. Also note that the character “A” is contained within an hlist that is also 345 points wide, and observe the large penalty (10000) together with \parfillskip and \rightskip glues at the end of the box contents. That penalty and the two glue items are inserted by TeX’s linebreaking activities. If you look at the glue_set value for the paragraph line (hlist) containing the letter “A” you will see it is extremely large (322.500000): why is that? It is because the paragraph line is 345pt wide but contains only a \parindent and the letter “A”: the remaining space has to be filled by the \parfillskip glue which has to stretch a considerable distance to fill the remaining space on the line.

Download PDF file

`\vbox to 25pt{\hbox{A}}`

It is very instructive to compare this example to the previous one. Here, not only is the node graph considerably smaller, but the width of the \vbox is just 7.50002pt: the same width as the character “A”. The reason is that the “A” has been wrapped in an \hbox which prevents the \vbox triggering TeX to perform linebreaking—an important characteristic of boxes created with \vbox.

Download PDF file

Simple maths: `\hbox{$\displaystyle \int f(x) dx$}`, complex box!

This example demonstrates that even very simple typeset mathematics creates a detailed box structure: typesetting mathematics produces extremely complex data structures within TeX!

Download PDF file

Credits: thanks Patrick!

Our thanks to Patrick Gundlach who has granted Overleaf permission to use and distribute a modified version of his Lua script, viznodelist.lua, which processes TeX boxes and outputs a file (in the dot language) that can be processed to draw a node graph. The Overleaf project contains a Lua script called hiviznodelist.lua—a renamed and modified version of Patrick’s original code, which is available on Github. Patrick has created an open-source LuaTeX-based typesetting system called speedata Publisher which you can download and use for free—commercial support options are also available.

Pandora’s \hbox: Using LuaTeX to Lift the Lid of TeX Boxes

Introduction

Why choose LuaTeX?

pdfTeX/XeTeX vs LuaTeX: in pictures

pdfTeX/XeTeX

LuaTeX

Boxes and glue: A brief reminder

TeX primitives for box construction

Glue: flexible spacing

An \hbox example

Back to LuateX

How TeX stores boxes in memory: an analogy

How TeX stores boxes in memory: hlists and vlists

An simple example of \directlua{...} in action

Putting it all together: An Overleaf project

What does the project provide?

Graphics from the Overleaf project: A brief description

How does the Overleaf project create those graphics?

Project examples

\vbox to 25pt{A}

\vbox to 25pt{\hbox{A}}

Simple maths: \hbox{$\displaystyle \int f(x) dx$}, complex box!

Credits: thanks Patrick!

Get in touch

An `\hbox` example

An simple example of `\directlua{...}` in action

`\vbox to 25pt{A}`

`\vbox to 25pt{\hbox{A}}`

Simple maths: `\hbox{$\displaystyle \int f(x) dx$}`, complex box!