As seen in previous posts, I’ve been using AntLR for my parsers. Nevertheless, I’ve been listening to discussions about the speed of Tree-Sitter parsers, and I decided to give it a try. While I learn how to use them, I will write a brief tutorial on the subject. Bear with me if you find any inaccuracies. Just drop me a line and I will fix them.
Tree-Sitter allows generating parsers in C, Go, JavaScript, Rust, Python and more. Basically, the parser is generated in pure C code, and then bindings for different languages are able to use it. Thus, the process has two parts: creating the parser and then using it. For this first part, we will only create the parser.
Installing Dependencies
To run tree-sitter we need NodeJS and to use the tree-sitter-cli tool we need Rust. Installing them depends on your operating system, so I will not go into those details. After installing these two languages, I installed tree-sitter-cli
with the following line:
cargo install tree-sitter-cli --locked
BashThere are other ways to install the tree-sitter-cli tool: .
Project Bootstrapping
For this tutorial, we will create a parser for BNF. The preferred convention is to name the parser repository as tree-sitter-
followed by the name of the language in lowercase. Thus, we will create a folder named tree-sitter-bnf
. Then, run the tree-sitter-cli
bootstrapper to create the boilerplate code.
mkdir tree-sitter-bnf
cd tree-sitter-bnf
# Run bootstrapper
tree-sitter init
BashThe CLI tool will ask a couple of questions. Follows the answers I gave. Some of them I am not yet sure for what they are used. Anyway, I gave an answer…
ambs@AlbertoS-LT:~/tree-sitter-bnf$ tree-sitter init
✔ Parser name · bnf
✔ CamelCase name · Bnf
✔ Title (human-readable name) · Bnf
✔ Description · A simple BNF syntax parser
✔ Repository URL · https://github.com/tree-sitter/tree-sitter-bnf
✔ Funding URL ·
✔ TextMate scope · source.bnf
✔ File types (space-separated) · bnf
✔ Version · 0.1.0
✔ License · MIT
✔ Author name · Alberto Simões
✔ Author email · ambs@cpan.org
✔ Author URL · https://ambs.zbr.pt
Your current configuration:
{
"name": "bnf",
"camelcase": "Bnf",
"title": "Bnf",
"description": "A simple BNF syntax parser",
"repository": "https://github.com/tree-sitter/tree-sitter-bnf",
"scope": "source.bnf",
"file_types": [
"bnf"
],
"version": "0.1.0",
"license": "MIT",
"author": "Alberto Simões",
"email": "ambs@cpan.org",
"url": "https://ambs.zbr.pt/"
}
✔ Does the config above look correct? · yes
After answering these questions, a lot of files are created in the current folder. Follows the tree of files and some comments on what these folders or files are. I didn’t go into too much detail. Any required information will be given during the tutorial.
.
├── CMakeLists.txt cmake for C bindings
├── Cargo.toml config file for Rust bindings
├── Makefile makefile for C bindings
├── Package.swift config file for SWIFT bindings
├── binding.gyp config file for JS bindings
├── bindings parser bindings, one folder per lang
│ ├── c
│ │ ├── tree-sitter-bnf.pc.in
│ │ └── tree_sitter
│ │ └── tree-sitter-bnf.h
│ ├── go
│ │ ├── binding.go
│ │ └── binding_test.go
│ ├── node
│ │ ├── binding.cc
│ │ ├── binding_test.js
│ │ ├── index.d.ts
│ │ └── index.js
│ ├── python
│ │ ├── tests
│ │ │ └── test_binding.py
│ │ └── tree_sitter_bnf
│ │ ├── __init__.py
│ │ ├── __init__.pyi
│ │ ├── binding.c
│ │ └── py.typed
│ ├── rust
│ │ ├── build.rs
│ │ └── lib.rs
│ └── swift
│ ├── TreeSitterBnf
│ │ └── bnf.h
│ └── TreeSitterBnfTests
│ └── TreeSitterBnfTests.swift
├── go.mod config file for GO bindings
├── grammar.js our BNF grammar
├── package.json tree-sitter grammar package file
├── pyproject.toml config file for Python bindings
├── setup.py setup file for Python bindings
└── tree-sitter.json tree-sitter grammar config
To reduce clutter, I removed the bindings for the following languages: Python, Swift, Node (JavaScript) and Go:
rm -fr bindings/swift/
rm -fr bindings/python/
rm -fr bindings/go/
rm -fr bindings/node/
BashI also removed the config files for these languages:
rm -fr binding.gyp go.mod setup.py Package.swift pyproject.toml
BashFinally, I edited the tree-sitter.json
file to disable the bindings for those languages:
"bindings": {
"c": true,
"go": false,
"node": false,
"python": false,
"rust": true,
"swift": false,
"zig": false
}
Writing the Parser
Before getting into the grammar details, let us generate the parser library. The default parser just recognizes the ‘hello’ string. Not very interesting. But this will allow us to see what files are generated by the CLI tool, and understand further how tree-sitter works.
tree-sitter generate
BashA couple more files were generated locally under the src
folder:
├── [...]
├── src
│ ├── grammar.json
│ ├── node-types.json
│ ├── parser.c
│ └── tree_sitter
│ ├── alloc.h
│ ├── array.h
│ └── parser.h
└── [...]
These are the C files required to compile the parser library, and other configuration files used by tree-sitter. How these files are used in an application is something we will look into in the next part of this tutorial. For now, we just need to know there is C code being generated under the hood.
As said before, our parser will recognize a very simple syntax for BNF grammars. So simple that it can’t describe itself. The BNF grammar is a set of rules. Each rule has a left side, that is a non-terminal starting with an uppercase letter. Follows an arrow and the body of the rule, which ends with a semicolon. The body of a rule can include optional branches (separated by the pipe symbol). Each branch is a sequence of symbols (terminal and non-terminal). Finally non-terminal symbols start with a lowercase.
Grammar -> Rule+
Rule -> NonTerminal '->' RuleBody ';'
RuleBody -> SymbolSeq ('|' SymbolSeq)*
SymbolSeq -> Symbol+
Symbol -> Terminal | NonTerminal
Terminal -> /[a-z_][a-zA-Z0-9]*/
NonTerminal -> /[A-Z][a-zA-Z0-9]*/
PlaintextA tree-sitter grammar is just a program in JavaScript, using some specific constructs to specify the grammar symbols. For now our grammar will be as follows:
module.exports = grammar({
name: "bnf",
rules: {
grammar: $ => repeat1($.rule),
rule: $ => seq($.nonTerminal, '->', $.ruleBody, ';'),
ruleBody: $ => seq($.symbolSeq, repeat(seq('|', $.symbolSeq))),
symbolSeq: $ => repeat1($.symbol),
symbol: $ => choice($.nonTerminal, $.terminal),
terminal: $ => /[a-z_][A-Za-z0-9_]*/,
nonTerminal: $ => /[A-Z][A-Za-z0-9_]*/,
}
});
JavaScriptAs you can see, the syntax is a little confusing at first. But we have a rule for each rule of the BNF syntax above, with the rule name at the left, and some kind of function invocation at the right. Each of those functions has a specific meaning and they work just like the BNF operators.
The repeat1
function is equivalent to the Keene Closure, with at least one occurrence (thus, a sequence of one or more occurrences of its content). Thus, a grammar
will be a sequence of one or more rule
.
The seq
represents a sequence. So, a rule
is a sequence of four symbols: a non-terminal symbol, the arrow, the rule body and a semicolon.
The ruleBody
is just a mix of using the functions described above, specifying that the body might contain just a sequence of symbols (symbolSeq
) or more than one, separated by a pipe symbol. The repeat
, unline repeat1
, do not require the minimum occurrence of one.
The symbolSeq
is, again, a sequence with at least one item, in this case, a symbol
.
The symbol
might be terminal or non-terminal, thus using the choice
function.
Finally, the two terminal symbols are described directly as regular expressions.
After saving the file, we can generate the parser again with tree-sitter generate
.
Testing the Parser
To test the parser, we will declare our BNF grammar using a less compact notation, incorporating recursion, as our parser does not recognise all the notation we used before.
Create the file sample.bnf
with the following content:
Grammar -> Rule | Grammar Rule;
Rule -> nonTerminal arrow RuleBody semicolon;
RuleBody -> SymbolSeq | RuleBody pipe SymbolSeq;
SymbolSeq -> Symbol | SymbolSeq Symbol;
Symbol -> terminal | nonTerminal;
PlaintextTo test that the parser is able to recognize this text, we will use the following command
tree-sitter parse < sample.bnf
BashTree-Sitter will output the parse tree (that is not that easy to understand, I know). But with some care, you can get hold of it. As an example, note that there are five rules inside the grammar. Another interesting point, note that there are no nodes for the literal nodes. For each node, you also get the line and column of the beginning and the end of the span.
If you like, you can change the grammar, including an error. Running again the command above you will see one or more ERROR nodes.
This tutorial will continue in the next days. Please comment on what you would like to read here, that I missed, or what you would like to be the next steps. I’ll try to do my best.