Stata's Best Document Publishing System, Texdoc
Mar 27, 2022
For better or worse, Stata is the default data analysis tool in economics.1 Its success is built on broad support for statistical models, easy syntax, and a straightforward workflow. For economics coursework, Stata excels.
Stata’s Achilles’ heel is its inflexibility. If Statacorp does not deem a use case worthy, it is at best inconvenient to bend the language to your needs. At worst, it is impossible. This means that Stata is slow to introduce standard features of modern languages.
A glaring omission is the lack of a notebook document publication tool. RMarkdown and Jupyter Notebook, two community-built tools for R and Python, have grown massively in the past five years. It is difficult to imagine a developing a modern, data science–focused programming language without a robust system for document publication. 2
This is an important use case and one I encounter frequently as an undergraduate writing problem sets and short papers. Of course, many languages including Stata can output figures and tables into their own .tex
files, but the overhead costs of with file management and custom tables and the lack of easy code and output formatting makes that solution not worthwhile.
The workarounds this problem are tedious and would make any data scientist shiver. There’s the good: exporting tables to LaTeX using estout
or outreg2
; the bad: copying and pasting the text output into Word using a fixed-width font; and the ugly: screenshotting the Stata output window and inserting the picture into Word.
I have yet to see any professor use Texdoc, a LaTeX-integrated, notebook publishing package for Stata written by Ben Jann, the creator of estout
. This post introduces texdoc
, its workflow, and some experiences of my own with the package.
Texdoc Basics
After installation, texdoc
works out of a normal do-file. In addition to outputting results to the console, it produces a .tex
file that can be compiled with any LaTeX compiler.
Any given line will be one of three options:
- Stata Code/Comments -
texdoc
does not interrupt Stata code and comments, so files meant fortexdoc
can still be executed just like normal, producing the same output. 3 texdoc
Code - Lines beginning withtexdoc
tell the program what to print, what to execute in the background, how to display graphs, and more.- LaTeX Code - Code marked off by
/***
and***/
will be inserted as LaTeX code into the.tex
file. 4
Example
I’ll demonstrate with the auto dataset. Here’s a trivial do-file:
sysuse auto
sum mpg
hist mpg
reg mpg weight
First, texdoc
must be initialized. When texdoc
is called, this line tells it to create a .tex
file.
Prepend the do-file with this line.
texdoc init auto-analysis.tex, replace
texdoc
outputs a LaTeX file. For short documents, I keep all code in the do-file for easy editing.
After that line, LaTeX will need a preamble:
/***
\documentclass{article}
\usepackage{amsmath, amssymb}
\usepackage{stata}
\usepackage{graphicx}
\title{Texdoc Auto Analysis Example}
\author{Luke DiMartino}
\begin{document}
\maketitle
***/
The stata
package is required for the texdoc
sections to be compiled, and the graphicx
package is required to include graphs.
And append the closing LaTeX code to the end of the document.
/***
\end{document}
***/
I have already named this analysis and will be discussing the data in the document, so nobody needs to see the first line of Stata code loading the data, but, I would like to display the summary of the mpg
variable and the histogram.
To display code, surround it with texdoc stlog
and texdoc stlog close
.
texdoc stlog
sum mpg
texdoc stlog close
Next, I want to display the histogram. To display a graph, use texdoc graph, optargs(width=0.6\textwidth)
after texdoc stlog close
. 5
sysuse auto
texdoc stlog
sum mpg
hist mpg
texdoc stlog close
texdoc graph, optargs(width=0.6\textwidth)
Next, I’ll add the regression along with a model equation:
texdoc stlog
reg mpg weight
texdoc stlog close
/***
The regression equation is:
\begin{equation}
\text{MPG}_i = \beta_0 + \beta_1 \text{Weight}_i + \varepsilon_i
.\end{equation}
***/
And that’s it! Call texdoc do your_do_file.do
from Stata’s console and the .tex
file will be built. Every section and graph gets its own document, then texdoc includes them each in the main file. Ensure stata.sty
is in your directory and use your LaTeX editor/compiler to compile the finished document.
Script
Here is the full script, only missing the document preamble:
texdoc init auto-analysis.tex, replace
/***
--document preamble--
***/
sysuse auto
texdoc stlog
sum mpg
hist mpg
texdoc stlog close
texdoc graph, optargs(width=0.6\textwidth)
texdoc stlog
reg mpg weight
texdoc stlog close
/***
The regression equation is:
\begin{equation}
\text{MPG}_i = \beta_0 + \beta_1 \text{Weight}_i + \varepsilon_i
.\end{equation}
\end{document}
***/
Document Management
Of course, you may edit the .tex
file with your LaTeX editor after compiling, but I prefer editing anything, including the .tex
code, in the do-file. If you re-call texdoc do
, it overwrites the .tex
file it creates, meaning any changes you have made in that file get overwritten.
For longer documents (or group work), let texdoc compile to sections of LaTeX code. Produce hw1_1.tex
and hw1_2.tex
, then create a LaTeX file and add \include{hw1_1.tex}
and combine sections.
Those files could be texdoc
output or LaTeX typesetting. For longer papers, I keep a spare document of work that I can copy and paste from so that I can use my preferred LaTeX editor (Vim) for typesetting.
Cleaning Ugly Output
Stata’s output is occasionally quite ugly. For example, it prints a line for every iteration of log-likelihood optimization when fitting ARIMA models. Use outreg2
, esttab
, or the regression table package of your choice; output the table to a .tex
file; then add this below that regression table in the do-file so that LaTeX adds the table.
/***
\include{this_reg_table.tex}
***/
-
I cringe at the mention of Stata on the same site as data science, but here we are. ↩︎
-
Case in point: Julia, the new kid on the block of high-level, data science–focused programming languages, has IJulia to integrate into Jupyter Notebook. ↩︎
-
This is an advantage over
dyndoc
, another Stata publishing format that requires writing HTML in the do-file. That code prevents the user from executing the file normally to see code results like regression outputs. ↩︎ -
Of course, since Stata reads these as comments, they are ignored when executing the file normally. ↩︎
-
There are other graphing options that are more complex. ↩︎