Implementing re-entrant parsers in Bison and Flex

In days of yore, Yacc and Lex were two of the most useful tools in a Unix hacker’s kit. The way they interfaced to client code was, however, pretty ugly – global variables and magic macros hanging out all over the place. Their modern descendants, Bison and Flex, have preserved that ugliness in order to be backward-compatible.

That rebarbative old interface generally broke a lot of rules about program structure and information hiding that we now accept as givens (to be fair, most of those had barely been invented at the time it was written in 1970 and were still pretty novel). It becomes a particular problem if you want to run multiple instances of your generated parser (or, heaven forfend, multiple parsers with different grammars) in the same binary without having them interfere with each other.

But it can be done. I’m going to describe how because (a) it’s difficult to extract from the documentation, and (b) right now (that is, using Bison 3.0.2 and Flex 2.5.35) the interface is in fact slightly broken and there’s a workaround you need to know.

First, motivation. I had to figure out how to do this for cvs-fast-export, which contains a Bison/Flex grammar that parses CVS master files – often thousands of them in a single run. In the never-ending quest for faster performance (because there are some very old, very gnarly, and very large CVS repositories out there that we would nevertheless like to be able to convert without grinding at them for days or weeks) I have recently been trying to parallelize the parsing stage. The goal is to (a) to be able to spread the job across multiple processors so the work gets done faster, and (b) not to allow I/O waits for some masters being parsed to block compute-intensive operations on others.

In order to make this happen, cvs-fast-export has to manage a bunch of worker threads with a parser instance inside each one. And in order for that to work, the yyparse() and yylex() driver functions in the generated code have to be reentrant. No globals allowed; they have to keep their parsing and lexing state in purely stack- and thread-local storage, and deliver their results back the way ordinary reentrant C functions would do it (that is, through structures referenced by pointer arguments).

Stock Yacc and Lex couldn’t do this. A very long time ago I wrote a workaround – a tool that would hack the code they generated to encapsulate it. That hack is obsolete because (a) nobody uses those heirloom versions any more, and (b) Bison/Flex have built-in support for this. If you read the docs carefully. And it’s partly broken.

Here’s how you start. In your Bison grammar, you need to include include something that begins with these options:

%define api.pure full
%lex-param {yyscan_t scanner}
%parse-param {yyscan_t scanner}

Here, yyscan_t is (in effect) a special private structure used to hold your scanner state. (That’s a slight fib, which I’ll rectify later; it will do for now.)

And your Flex specification must contain these options:

%option reentrant bison-bridge

These are the basics required to make your parser re-entrant. The signatures of the parser and lexer driver functions change from yyparse() and yylex() (no arguments) to these:

yyparse(yyscan_t *scanner)
yylex(YYSTYPE *yylval_param, yyscan_t yyscanner)

A yyscan_t is a private structure used to hold your scanner state; yylval is where yylex() will put its token value when it’s called by yyparse().

You may be puzzled by the fact that the %lex-param declaration says ‘scanner’ but the scanner state argument ends up being ‘yyscanner’. That’s reasonable, I’m a bit puzzled by it myself. In the generated scanner code, if there is a scanner-state argument (forced by %reentrant) it is always the first one and it is always named yyscanner regardless of what the first %lex-param declaration says – that first declaration seems to be a placeholder. In contrast, the first argument name in the %parse-params declaration actually gets used as is.

You must call yyparse() like this:

    yyscan_t myscanner;

    yylex_init(&myscanner);
    yyparse(myscanner);
    yylex_destroy(myscanner);

The yyinit() function call sets scanner to hold the address of a private malloced block holding scanner state; that’s why you have to destroy it explicitly.

The old-style global variables, like yyin, become macros that reference members of the yyscan_t structure; for a more modern look you can use accessor functions instead. For yyin that is the pair yyget_in() and yyset_in().

I hear your question coming. “But, Eric! How do I pass stuff out of the parser?” Good question. My Bison declarations actually look like this:

%define api.pure full
%lex-param {yyscan_t scanner} {cvs_file *cvsfile}
%parse-param {yyscan_t scanner} {cvs_file *cvsfile}

Notice there’s an additional argument. This should change the function signatures to look like this:

yyparse(yyscan_t scanner, cvs_file *cvs)
yylex(YYSTYPE *yylval_param, yyscan_t yyscanner, cvs_file *cvs)

Now you will call yyparse() like this:

    yyscan_t myscanner;

    yylex_init(&myscanner);
    yyparse(myscanner, mycvs);
    yylex_destroy(myscanner);

Le voila! The cvs argument will now be visible to the handler functions you write in your Bison grammar and your Lex specification. You can use it to pass data to the caller – typically, as in my code, the type of the second argument will be a structure of some sort. The documentation insists you can add more curly-brace-wrapped argument declarations to %lex-param and %parse-param to declare multiple extra arguments; I have not tested this.

Anyway, the ‘yyscanner’ argument will be visible to the helper code in your lex specification. The ‘scanner’ argument will be visible, under its right name, in your Bison grammar’s handler code. In both cases this is useful for calling accessors like yyget_in() and yyget_lineno() on. The cvs argument, as noted before, will be visible in both places.

There is, however, one gotcha (and yes, I have filed a bug report about it). Bison should arrange things so that all the %lex-params information is automatically passed to the generated parser and scanner code via the header file Bison generates (which is typically included in the C preambles to your Bison grammar and Lex specification). But it does not.

You have to work around this, until it’s fixed, by defining a YY_DECL macro that expands to the correct prototype and is #included by both generated source code files. When those files are expanded by the C preprocessor, the payload of YY_DECL will be put in the correct places.

Mine, which corresponds to the second set of declarations above, looks like this:

#define YY_DECL int yylex \
	(YYSTYPE * yylval_param, yyscan_t yyscanner, cvs_file *cvs)

There you have it. Reentrancy, proper information hiding – it’s not yer father’s parser generator. For the fully worked example, see the following files in the cvs-fast-export sources: gram.y, lex.l, cvs.h, and import.c.

Mention should be made of Make a reentrant parser with Flex and Bison, another page on this topic. The author describes a different technique requiring an uglier macro hack. I wasn’t able to make it work, but it started me looking in approximately the right direction.