• A Data-Driven Approach to LaTeX Autocomplete

    Posted by Nate on August 24, 2017

    Improved Autocomplete

    Autocomplete

    Nearly anywhere you go on the web today you will find some sort of autocomplete feature. Start typing into Google and you get immediate suggestions related to your query. If you code in other languages, many IDEs have built-in, or configurable, autocomplete tools that complete variables, functions, methods, etc., with varying degrees of success. At the best, these tools speed up the process of programming by actively bug checking, suggesting variables of the correct type, methods of the correct object/class, and can sometimes offer documentation when opening functions. These tools allow the user to focus their time on more valuable concepts and ideas, rather than syntax.

    When learning a new programming language, especially if it is your first language, it can be difficult to remember syntax and \(\mathrm{\LaTeX}\) is no exception. “Was it \product ,\times, \mult, \prod or something else to produce \(\prod\)?” These questions are often asked by new users and can range from being a rather minor nuisance, to a painstakingly slow and annoying time sink. By the way, it is \prod☺.

    To help combat the aforementioned issues (and others), Overleaf has included a default list of commands which it will suggest. Just type \ into the editor to see the dropdown list. The list is by no means comprehensive, but it does offer most of the frequently used commands that are needed to build a basic \(\mathrm{\LaTeX}\) document.

    What can we do to improve?

    When suggesting commands, as of now, we simply use a fuzzy-search using Fuse.js which works well in some cases, but surely does not account for the popularity of commands. For example, when typing \c, the fuzzy search ranks commands beginning with c first, and hence columnbreak is the first completion. While yes, according to the algorithm this is a good match, it isn't the best for productivity. Wouldn't it be nice if chapter, cite, caption, and centering were suggested before that?

    To make this happen we have begun studying which commands are being used frequently in publicly available \(\mathrm{\LaTeX}\) documents. Fortunately, there are many collections of public \(\mathrm{\LaTeX}\) documents that we can use, such as the arXiv, and also the Overleaf Gallery, which contains just under 8000 .tex documents (we define a document as a single .tex file). For this blog post, we’re going to use the Overleaf gallery, which contains a mixture of research articles, presentations, and CVs. It also contains a large number of \(\mathrm{\LaTeX}\) examples and \(\mathrm{\LaTeX}\) templates, which are not necessarily the most representative documents, but it provides a good starting point, and as we will see, a useful one.

    The reason we want chapter, cite, caption, and centering to be ranked ahead of columnbreak is because they are used more. So to find the commands that should top the suggestion list one might think to simply look at raw counts of commands. Doing this for our given corpus we find some odd commands in the top ten list we didn't quite expect (pgf, and pdfglyphtounicode). After a little investigation we found these commands were appearing tens of thousands of times in very few documents and nowhere else. To avoid such extreme cases we weight commands by the number of documents they appear in (see Methodology for details).

    It is not too surprising that we find many of \(\mathrm{\LaTeX}\)’s structural commands in the top ten list, but perhaps textbf is somewhat surprising. I guess bolding is more fashionable than italicizing. Another feature which might spark some interest is the relatively high frequency of chapter when excluding no appearances. This is because of the context in which this command appears. Often, when writing documents with multiple chapters, authors will break these chapters into separate files and have main.tex call the respective chapters in via \chapter{foo}\input{chapters/foo}. This somewhat artificially inflates the frequency of the chapter command (and possibly other commands). We say artificially, because really the input files are all apart of the same project, and they should be considered together. This, however, has not yet been done in our analysis.

    We can produce an analagous bar plot for environments (anything that starts with \begin{…} and ends with \end{…}) where we view an entire environment as a single entity. The following shows mostly what we would expect, with document appearing the most often among all documents, but rather peculiarly, we see the frequency of the frame environment (from the beamer package) is very large when we have enforced a single appearance. This means that while it's not the most frequently occurring environment, when it is used, it constitutes nearly 40% of the document's environments!

    With this data we will rank commands based on their corpus frequency so you spend less time looking for your command, and more time focusing on what’s important. Now you may say:

    Hold on, so even once I start my document environment, the next time I open an environment I will be suggested \begin{document} as the number one completion?

    Well, it takes some fine tuning. In particular one thing we can do is look at the median number of times these commands are used in a document (excluding no appearances). Doing this gives us a better picture of how many times commands are being used in a given document. So if a command is usually being used one time per document, then we probably shouldn’t continue suggesting it after that one use (or at least push it down the list). Below you can see the the median number of uses of commands in documents in which they appear. Use the dropdown menu to toggle between the top 10 commands and environments!


    Dissociating the Data

    We have a fair amount of data at this point, and while what we have seen thus far is helpful, we can do better. \(\mathrm{\LaTeX}\) documents should not solely be considered as a stream of input tokens; rather they have logical structure. We would very likely get cleaner, more representative data if we took this into account.

    \(\mathrm{\LaTeX}\) Structure

    \(\mathrm{\LaTeX}\) documents are built up from smaller pieces, namely commands and environments. Given we have already studied the global use of commands and environments, it is then important to look more closely at how these are used together.

    Preamble

    Ideally we want to suggest commands based on the context of the cursor's position within your document. An important example of context is the \(\mathrm{\LaTeX}\) document preamble: there, it is highly unlikely that you will need to use commands such as \section{…}, \chapter{…}, or many math commands. Wouldn’t it be nice if we didn’t suggest them?

    We can perform a very similar analysis as above to find the commands which occur in the preamble of all the documents (that is commands that occur before \begin{document}, and ignoring documents that contained no document environment).

    If you have ever composed a preamble and loaded some packages then it is unlikely that these results will come as a surprise. The long tail on the above plot is attributed to the fact that preambles, while sharing some structure, can vary wildly based on which packages are loaded. It is often the case that commands used in the preamble are dependent upon which packages have already been loaded—we’ll address that point in a minute.

    Environments

    Just as above we can study which commands are being used most frequently in given environments. In particular, we explore the top 10 as case studies and these can be viewed in the following plot's dropdown menu.

    And here we begin to see some real structure emerging from this data. We see much more definitive trends in the data such as the item command being used extremely heavily in list-like environments, the includegraphics command being used heavily in the figure environment, and so on. This data will allow us to provide context-sensitive autocomplete suggestions based on which document element is currently being edited—providing a much more effective and efficient editing experience.

    An important feature to note is the seemingly high frequency of begin and end commands appearing within environments. Naturally, this suggests that documents often have nested environments—-which can be common in \(\mathrm{\LaTeX}\) documents, depending on which environments are being used; for example, \begin{table}\begin{tabular}…. If we could understand these nesting patterns we would even be able to provide context-aware environment suggestions! Of course we have a very finite data set, so we can only take this so far.

    Packages

    For future work, we will begin to explore links between which packages have been loaded and which commands are used most frequently in conjunction with those packages—to suggest commands based on the packages you have loaded.

    What's Next?

    While getting this data is one thing, implementing it is another. We've already started to improve ShareLaTeX's autocomplete (since we’ve now joined forces): now, along with suggesting commands you have already used in your document, it will suggest the top 100 most frequent commands as indicated in the analysis above!

    While we acknowledge this data set is not completely representative, it has given us a great birds-eye view of what .tex documents look like and how people are using the language. In order to obtain data with more predictive power, we are continuing to study the structure and use of \(\mathrm{\LaTeX}\) documents. Along with this, we plan to add more corpora to our existing Overleaf Gallery such as source files from the arXiv and maybe even GitHub.


    Methodology

    In order to compute the frequencies plotted in the What can we do to improve section let's establish a bit of notation. Let the corpus, or collection of documents, be \(\mathsf{D}\) and the collection of all commands used in \(\mathsf{D}\) be \(\mathsf{C}_\mathsf{D}\). Fun fact: there are roughly 15000 unique commands used throughout this corpus and over 900000 total command uses! For each command \(\mathsf{c}\) in \(\mathsf{C}_\mathsf{D}\), we can calculate its local frequency with respect to each document \(\mathsf{d}\in\mathsf{D}\) as the simple ratio

    \[f_\mathsf{c,d} = \frac{n_\mathsf{c}}{N_\mathsf{d}}\]

    where \(n_\mathsf{c}\) is the number of times the command \(\mathsf{c}\) appears in document \(\mathsf{d}\), and \(N_\mathsf{d}\) is the total number of command uses in the document \(\mathsf{d}\). Note that for many commands \(f_\mathsf{c,d}\) will be 0 if command \(\mathsf{c}\) does not appear in document \(\mathsf{d}\). We can now calculate the global frequency of each command in the given corpus by averaging all local frequencies. \[f_\mathsf{c} = \frac{1}{|\mathsf{D}|} \sum_{\mathsf{d}\in\mathsf{D}} f_{\mathsf{c,d}}\] where \(|\mathsf{D}|\) is the number of documents in the corpus. This method of calculating frequencies weights commands not only by how many uses they have, but also how many documents we find them in. This gives an effective measure of the permeability of commands through a wide range of documents, and it is what you see plotted above in the lighter green .

    With this information alone we can rank commands based on how often they are used. What is also interesting to look at once we have this data is, given the most used commands, how often are they used in documents that they do appear in. So a modified frequency \(\tilde{f}_{!\mathsf{c}}\) dependent on the set \(\mathsf{D}_\mathsf{c}\) which consists of all documents \(\mathsf{d}\) such that command \(\mathsf{c}\) is found in \(\mathsf{d}\) (that is \(\mathsf{D}_\mathsf{c} = {\mathsf{d}\in\mathsf{D}\,|\,\mathsf{c}\in\mathsf{d}}\)). \[\tilde{f}_{!\mathsf{c}} = \frac{1}{|\mathsf{D}_\mathsf{c}|} \sum_{\mathsf{d}\in\mathsf{D}_\mathsf{c}} f_{\mathsf{c,d}}\] This quantity expresses more features about how commands are being used in their respective documents, rather than taking a corpus view. In the plots above, this is represented with the darker shade of green . Note the plots are sorted by their corpus, or global, frequency.