If you keep up with my blog, you might remember a tutorial series on using Tree-sitter. At that time, I expressed my frustration on using it to write grammars, as they are expressed in JavaScript. But I ended up learning it, and for that, I invested some time in the tutorial. The tutorial example was a small conversion script, from BNF to Tree-sitter, so that I could write my grammars in the BNF syntax, instead of using JavaScript.
Tree-sitter is a fantastic parser generator. It is incremental, error-tolerant, and produces concrete syntax trees that are used by editors like Neovim and Helix for syntax highlighting, folding, and more. The lexer is context aware, and the constructed parsers, written in C, are quite fast. It has bindings for most programming languages, which makes it easier to reuse a grammar for different tasks, using different languages.
Tree-sitter grammars are written in JavaScript. Not a subset of JavaScript, not a configuration format — actual JavaScript. It can be seen as a DSL (Domain Specific Language), that uses a set of well defined functions: seq(), choice(), repeat(), optional(), token(), and so on.
It you did not read my tutorial series on Tree-sitter, here is fragment of a real grammar describing how a rule body is structured:
ruleBody: $ => seq($.symbolSeq, repeat(seq('|', $.symbolSeq))),
symbolSeq: $ => seq(repeat1($.symbol), optional(field('prec', $.precAnnotation))),
symbol: $ => seq(
optional(field('label', $.fieldLabel)),
field('content', $._symbolContent),
optional(field('kleene', $._kleeneOp))
),
_symbolContent: $ => choice(
$.nonTerminal, $._terminal, $.subSeq,
$.aliasGroup, $.tokenExpr, $.tokenImmediateExpr, $.precGroup
),JavaScriptIt works. It is expressive. But it is not easy to read at a glance — the syntactic structure of the language you are describing is buried under a layer of JavaScript boilerplate. For me, at least, used to BNF for some years, it takes conscious effort to see the grammar through the noise.
The BNF alternative
From the code used in the tutorial series, I built ts-bnf-tool to scratch that itch. It lets me write my grammar in a compact BNF dialect and generates the equivalent grammar.js.
The same fragment above, in BNF:
ruleBody -> symbolSeq ('|' symbolSeq)* ;
symbolSeq -> symbol+ precAnnotation? ;
symbol -> fieldLabel? _symbolContent _kleeneOp? ;
_symbolContent -> nonTerminal | _terminal | subSeq
| aliasGroup | tokenExpr | tokenImmediateExpr | precGroup ;PlaintextIf you learned parsers and compilers generators at school, this syntax should be clean and transparent to you, making the structure of the language clear.
The syntax
The dialect follows familiar BNF conventions with a few extensions for Tree-sitter features:
| Construct | Syntax | Example |
|---|---|---|
| Rule | name -> body ; | expr -> term '+' term ; |
| Comment | # text | # arithmetic |
| Literal terminal | 'text' or "text" | '->' |
| Pattern terminal | /regex/ | /[0-9]+/ |
| Non-terminal reference | bare identifier | term |
| Sequence | juxtaposition | '(' expr ')' |
| Alternative | | | expr | term |
| Zero or more | * | term* |
| One or more | + | term+ |
| Optional | ? | ','? |
| Grouping | ( ) | ('a' | 'b')+ |
| Token expression | << >> | << /[a-z]/ /[a-z0-9]*/ >> |
| Token immediate | <<! >> | '-' <<! /[0-9]+/ >> |
| Field label | name: symbol | lhs: expr |
| Alias group | (body => name) | (a b => pair) |
| Precedence | %prec.left N | expr '+' expr %prec.left 1 |
| Conflicts | %conflicts [r1, r2] | %conflicts [expr, term] |
| Inline | %inline r | %inline _helper |
| Supertypes | %supertypes r | %supertypes expression |
| Extras | %extras item | %extras /\s/, comment |
The << >> notation maps to tree-sitter’s token() — it forces the enclosed expression to be considered as single indivisible token, with no whitespaces allowed between its parts. <<! >> maps to token.immediate(), which additionally requires that no whitespaces precedes the token — useful for
things like a sign that must be adjacent to its number.
The mapping
Running ts-bnf-tool file.bnf prints a complete grammar.js ready to use in tree-sitter. Each BNF construct maps directly to its Tree-sitter counterpart:
| BNF | tree-sitter JS |
|---|---|
a b c | seq($.a, $.b, $.c) |
a | b | choice($.a, $.b) |
a* | repeat($.a) |
a+ | repeat1($.a) |
a? | optional($.a) |
<< body >> | token(body) |
<<! body >> | token.immediate(body) |
name: symbol | field('name', ...) |
(body => alias) | alias(body, ...) |
body %prec.left N | prec.left(N, body) |
There is no magic and no ambiguity: the BNF is a thin syntactic layer over the tree-sitter DSL, or as I like to call, its syntax sugar.
Eating our own cooking
The tool parses BNF using a Tree-sitter grammar — itself written in JavaScript, of course. But the repository also ships grammar/bnf.bnf, which is the grammar of the BNF dialect expressed in itself. It serves as the canonical human-readable specification of what the tool accepts.
Here is a representative excerpt, describing how the tool’s own directives and token constructs are parsed:
%extras /\s/, comment
%conflicts [symbolSeq, symbolSeqInner]
%inline _directive
grammar -> (rule | _directive)+ ;
_directive -> conflictsDirective | inlineDirective
| supertypesDirective | extrasDirective ;
conflictsDirective -> '%conflicts' conflictGroup (',' conflictGroup)* ;
conflictGroup -> '[' nonTerminal (',' nonTerminal)* ']' ;
tokenExpr -> '<<' ruleBody '>>' ;
tokenImmediateExpr -> '<<!' ruleBody '>>' ;
literal -> << /'([^'\\]|\\.)*'/ | /"([^"\\]|\\.)*"/ >> ;
comment -> << '#' /.*/ >> ;PlaintextRunning ts-bnf-tool grammar/bnf.bnf produces output that matches grammar.js rule-for-rule.
Using it
# Install from source
git clone https://github.com/ambs/tree-sitter-bnf-tools
cd tree-sitter-bnf-tools
make release # binary at target/release/ts-bnf-tool
# Preview the generated grammar.js
ts-bnf-tool my-language.bnf
# Generate a complete tree-sitter project
ts-bnf-tool --generate my-language.bnfShellScriptThe --generate flag writes grammar.js into a new directory and runs tree-sitter generate for you, producing src/parser.c in one step.
Tree-sitter grammars are not going to stop being JavaScript. But for me, being able to draft and read them in BNF and generate the boilerplate makes the whole process considerably more pleasant. If you feel the same way, the repository is at https://github.com/ambs/tree-sitter-bnf-tools. There are some tasks open, already, but I am open for suggestions. I should also release it sooner or later in crates.io.
This post was bootstrapped using AI.