Writing Tree-Sitter Grammars in BNF

If you keep up with my blog, you might remember a tutorial series on using Tree-sitter. At that time, I expressed my frustration on using it to write grammars, as they are expressed in JavaScript. But I ended up learning it, and for that, I invested some time in the tutorial. The tutorial example was a small conversion script, from BNF to Tree-sitter, so that I could write my grammars in the BNF syntax, instead of using JavaScript.

Tree-sitter is a fantastic parser generator. It is incremental, error-tolerant, and produces concrete syntax trees that are used by editors like Neovim and Helix for syntax highlighting, folding, and more. The lexer is context aware, and the constructed parsers, written in C, are quite fast. It has bindings for most programming languages, which makes it easier to reuse a grammar for different tasks, using different languages.

Tree-sitter grammars are written in JavaScript. Not a subset of JavaScript, not a configuration format — actual JavaScript. It can be seen as a DSL (Domain Specific Language), that uses a set of well defined functions: seq(), choice(), repeat(), optional(), token(), and so on.

It you did not read my tutorial series on Tree-sitter, here is fragment of a real grammar describing how a rule body is structured:

JavaScript
ruleBody:   $ => seq($.symbolSeq, repeat(seq('|', $.symbolSeq))),
symbolSeq:  $ => seq(repeat1($.symbol), optional(field('prec', $.precAnnotation))),
symbol:     $ => seq(
  optional(field('label', $.fieldLabel)),
  field('content', $._symbolContent),
  optional(field('kleene', $._kleeneOp))
),
_symbolContent: $ => choice(
  $.nonTerminal, $._terminal, $.subSeq,
  $.aliasGroup, $.tokenExpr, $.tokenImmediateExpr, $.precGroup
),
JavaScript

It works. It is expressive. But it is not easy to read at a glance — the syntactic structure of the language you are describing is buried under a layer of JavaScript boilerplate. For me, at least, used to BNF for some years, it takes conscious effort to see the grammar through the noise.

The BNF alternative

From the code used in the tutorial series, I built ts-bnf-tool to scratch that itch. It lets me write my grammar in a compact BNF dialect and generates the equivalent grammar.js.

The same fragment above, in BNF:

Plaintext
ruleBody   -> symbolSeq ('|' symbolSeq)* ;
symbolSeq  -> symbol+ precAnnotation? ;
symbol     -> fieldLabel? _symbolContent _kleeneOp? ;
_symbolContent -> nonTerminal | _terminal | subSeq
               | aliasGroup | tokenExpr | tokenImmediateExpr | precGroup ;
Plaintext

If you learned parsers and compilers generators at school, this syntax should be clean and transparent to you, making the structure of the language clear.

The syntax

The dialect follows familiar BNF conventions with a few extensions for Tree-sitter features:

ConstructSyntaxExample
Rulename -> body ;expr -> term '+' term ;
Comment# text# arithmetic
Literal terminal'text' or "text"'->'
Pattern terminal/regex//[0-9]+/
Non-terminal referencebare identifierterm
Sequencejuxtaposition'(' expr ')'
Alternative|expr | term
Zero or more*term*
One or more+term+
Optional?','?
Grouping( )('a' | 'b')+
Token expression<< >><< /[a-z]/ /[a-z0-9]*/ >>
Token immediate<<! >>'-' <<! /[0-9]+/ >>
Field labelname: symbollhs: expr
Alias group(body => name)(a b => pair)
Precedence%prec.left Nexpr '+' expr %prec.left 1
Conflicts%conflicts [r1, r2]%conflicts [expr, term]
Inline%inline r%inline _helper
Supertypes%supertypes r%supertypes expression
Extras%extras item%extras /\s/, comment

The << >> notation maps to tree-sitter’s token() — it forces the enclosed expression to be considered as single indivisible token, with no whitespaces allowed between its parts. <<! >> maps to token.immediate(), which additionally requires that no whitespaces precedes the token — useful for
things like a sign that must be adjacent to its number.

The mapping

Running ts-bnf-tool file.bnf prints a complete grammar.js ready to use in tree-sitter. Each BNF construct maps directly to its Tree-sitter counterpart:

BNFtree-sitter JS
a b cseq($.a, $.b, $.c)
a | bchoice($.a, $.b)
a*repeat($.a)
a+repeat1($.a)
a?optional($.a)
<< body >>token(body)
<<! body >>token.immediate(body)
name: symbolfield('name', ...)
(body => alias)alias(body, ...)
body %prec.left Nprec.left(N, body)

There is no magic and no ambiguity: the BNF is a thin syntactic layer over the tree-sitter DSL, or as I like to call, its syntax sugar.

Eating our own cooking

The tool parses BNF using a Tree-sitter grammar — itself written in JavaScript, of course. But the repository also ships grammar/bnf.bnf, which is the grammar of the BNF dialect expressed in itself. It serves as the canonical human-readable specification of what the tool accepts.

Here is a representative excerpt, describing how the tool’s own directives and token constructs are parsed:

Plaintext
%extras    /\s/, comment
%conflicts [symbolSeq, symbolSeqInner]
%inline    _directive

grammar    -> (rule | _directive)+ ;
_directive -> conflictsDirective | inlineDirective
           | supertypesDirective | extrasDirective ;

conflictsDirective -> '%conflicts' conflictGroup (',' conflictGroup)* ;
conflictGroup      -> '[' nonTerminal (',' nonTerminal)* ']' ;

tokenExpr          -> '<<' ruleBody '>>' ;
tokenImmediateExpr -> '<<!' ruleBody '>>' ;
literal            -> << /'([^'\\]|\\.)*'/ | /"([^"\\]|\\.)*"/ >> ;
comment            -> << '#' /.*/ >> ;
Plaintext

Running ts-bnf-tool grammar/bnf.bnf produces output that matches grammar.js rule-for-rule.

Using it

ShellScript
# Install from source
git clone https://github.com/ambs/tree-sitter-bnf-tools
cd tree-sitter-bnf-tools
make release          # binary at target/release/ts-bnf-tool

# Preview the generated grammar.js
ts-bnf-tool my-language.bnf

# Generate a complete tree-sitter project
ts-bnf-tool --generate my-language.bnf
ShellScript

The --generate flag writes grammar.js into a new directory and runs tree-sitter generate for you, producing src/parser.c in one step.


Tree-sitter grammars are not going to stop being JavaScript. But for me, being able to draft and read them in BNF and generate the boilerplate makes the whole process considerably more pleasant. If you feel the same way, the repository is at https://github.com/ambs/tree-sitter-bnf-tools. There are some tasks open, already, but I am open for suggestions. I should also release it sooner or later in crates.io.

This post was bootstrapped using AI.

Leave a Reply