Tree-Sitter Tutorial – Part #3

C is a nice programming language, I enjoy playing with. But we need to accept that, unless we use libraries, writing code from scratch takes a lot of time. With that in mind, and given that I am also trying to learn Rust, I decided to write this post, very similar to the second part of this tutorial, but now, with Rust as the target programming language. This will allow any other new parts (if I end up writing any) to use Rust data structures to make the code simpler.

Thus, let’s dive into the awesome Rust and Tree-Sitter world.

Get into the main folder for our BNF parser. Remember there is a Cargo.toml file there, created by the bootstrap process (part #1 of this tutorial). This file is ready to build a Rust Crate for the BNF parser library. Then, it is just a matter of creating our tool source code and using that crate as a dependency.

Hopefully, you have Rust installed. If not, check their webpage, and understand what the best approach to you. As soon as you have it ready, run cargo build. The result will be some new folders. For example, a target folder and a Cargo.lock. I will not dig into the details of why these files are created.

To be able to package the file, we need a read-me file. I just created a mostly empty one using the following command:

echo '# BNF Parser Library' > README.md
Bash

Now, type cargo package. A new folder inside the target folder is created, with the name target/package. It contains your shiny new crate.

The next step is to create our application. Get out of the tree-sitter-bnf folder, and create a new folder for our bnf2ts application, now written in Rust. Get inside that folder and type cargo init. It will bootstrap the structure for a Rust application. It will create a Cargo.toml for your application and a src/main.rs, which is your new Rust application.

mkdir bnf2ts
cargo init
Bash

We will start by adding the dependency to the tree sitter library and to our shiny new crate we just created above. Edit the Cargo.toml file and add this content:

Cargo.toml
[package]
name = "bnf2ts"
version = "0.1.0"
edition = "2025"

[dependencies]
tree-sitter="0.25.3"

[dependencies.tree-sitter-bnf]
path = "../tree-sitter-bnf"
TOML

The dependency on the tree sitter library is added normally, but the dependency on our Cargo file is set with a relative path. This way we can test it without publishing the library officially.

Parsing a File

Just like in the previous part of this tutorial, we will first create the basic code to read the BNF file and parse it, and then, add the code to visit it, outputting a tree-sitter grammar.

src/main.rs
use std::env;
use std::fs::File;
use std::io::Read;

use tree_sitter::{Node, Parser};

fn main() {
    // Get command-line arguments
    let args: Vec<String> = env::args().collect();

    // Check if a filename was provided
    if args.len() != 2 {
        eprintln!("Usage: {} <filename>", args[0]);
        std::process::exit(1);
    }

    // Get the filename from the first argument (args[0] is the program name)
    let filename = &args[1];

    // Open the file
    let mut file = File::open(filename)
              .expect(&format!("Error opening file: {}", filename));

    // Create a string to store the file contents
    let mut source_code = String::new();

    // Read the file contents into the string
    file.read_to_string(&mut source_code)
              .expect(&format!("Error reading file: {}", filename));
        
    let mut parser : Parser = Parser::new();
    // Set the language to BNF
    parser.set_language(&tree_sitter_bnf::LANGUAGE.into())
              .expect("Error loading BNF grammar");
            
    // Call the parser
    let tree = parser.parse(&source_code, None)
              .expect(&format!("Error parsing source code: {}", filename))
        
    let root_node = tree.root_node();
}
Rust

If you know Rust, this should be easy to read. If not, I will try to give a quick overview:

  • Line 9: Get the arguments of the command line into a vector of strings. As env::args() returns an iterator we use collect() to collect all the items from the iterator into a list.
  • Lines 12 to 15: Check the number of arguments. Just like in C, the program name is the first one. Thus, we need two.
  • Line 18: Get the filename provided in the command line.
  • Lines 21 and 22: Open the file and complain if it fails. expect unwraps the Result object returned by the open() call. If the Result is Ok, its value is returned. If the Result is Err, the error message is printed and the program panics.
  • Line 25: Create an empty string where to load the source code.
  • Lines 28 and 29: Read the contents of the source code into the string. Again, panic if any error occurs.
  • Line 31: Create a new tree-sitter parser object.
  • Lines 33 and 34: Set the language of the tree-sitter parser to BNF. The idiom here is dense. Basically, into() is a conversion tool, that fits the type returned by the LANGUAGE field into the type required by the set_language method.
  • Lines 37 and 38: Call the tree-sitter parser. Just like in C, we pass the source code and a previous parse tree, in case we have one. For Rust we do not need to string size, as the Rust object stores that information.
  • Line 40: Fetch the root node from the parsing tree.

Time to compile it and run. You can test with the following line:

crate run sample.bnf
Bash

If no output is printed, everything looks good.

Visiting the Parse Tree

As stated previously, I will try to follow the same idea of the C implementation, but I will replace the big if/else conditional block with a match switch case, and will create a function to handle each grammar type.

Add the following line at the end of your main function:

src/main.rs
    visit(&root_node, &source_code);
Rust

The visit function will dispatch the node processing accordingly with its type. In the Rust bindings, the node type is named as Kind:

src/main.rs
fn visit(node: &Node<'_>, source_code: &str) {
    let kind = node.kind();

    match kind {
        "grammar" => visit_grammar(&node, &source_code),
        "rule" => visit_rule(&node, &source_code),
        "nonTerminal" => visit_non_terminal(&node, &source_code),
        "terminal" => visit_terminal(&node, &source_code),
        "ruleBody" => visit_rule_body(&node, &source_code),
        "symbolSeq" => visit_symbol_seq(&node, &source_code),
        "symbol" => visit_symbol(&node, &source_code),
        _ => println!("Node kind: {}", kind)
    }
}
Rust

First, a peek into the function signature. We receive a Node reference. The <'_> part means that this node has references to other variables that need to live as long as the node (that is, the garbage collector should not rip them off while the Node exists). The usage of this specific syntax means that we are letting the compiler to infer the lifetime (thus the _ instead of a specific lifetime).

Line 46 gets the node kind, and then we switch over it. All branches to the same: call a specific visit function for the node type.

Just line in the C code, line 56 is used for debug, in case a new kind of node appears in town.

src/main.rs
fn visit_grammar(node: &Node<'_>, source_code: &str) {
    println!("rules: {{");
    let count = node.child_count();
    for i in 0..count {
        let child = node.child(i).unwrap();
        visit(&child, &source_code);
    }
    println!("}}");
}
Rust

This code is straightforward, as it just mimics the C code. Note that instead of calling the visit_rule, we are invoking the visit function, allowing it to dispatch the execution. This allows a little more flexibility if we need to rewrite the grammar. Note that the child function returns an Option<Node>, as you can specify an index out of bounds, and an error needs to be raised. Here we call unwrap blindly to get rid of the Option and get access to the node.

src/main.rs
fn visit_rule(node: &Node<'_>, source_code: &str) {
    let rule_name = node.child(0).unwrap();
    let rule_body = node.child(2).unwrap();

    visit(&rule_name, &source_code);
    print!(": $ => ");
    visit(&rule_body, &source_code);
    println!(",");
}
Rust

We have four children for the rule grammar production. The first is the rule name, the second the arrow separator, the third is the rule body and the fourth is the ending semicolon. As before, we call the visit function instead of the specific one.

src/main.rs
fn visit_non_terminal(node: &Node<'_>, source_code: &str) {
    let start_byte = node.start_byte();
    let end_byte = node.end_byte();
    let text = &source_code[start_byte..end_byte];
    print!("{}", text);
}

fn visit_terminal(node: &Node<'_>, source_code: &str) {
    let start_byte = node.start_byte();
    let end_byte = node.end_byte();
    let text = &source_code[start_byte..end_byte];
    print!("{}", text);
}
Rust

For visiting the terminal and non-terminal symbols, I decided to split into two different functions. They do the same for now. As in the C version, we need to get the start and end bytes of the string to get the matching part. This is not entirely true, as the Rust bindings have a shortcut for that. But this is the same algorithm as the C counterpart, and allows better understanding of what is going on.

src/main.rs
fn visit_rule_body(node: &Node<'_>, source_code: &str) {
    let count = node.child_count();
    if count == 1 {
        let child = node.child(0).unwrap();
        visit(&child, &source_code);
    } else {
        print!("choice(");
        let mut i = 0;
        while i < count {
            visit(&node.child(i).unwrap(), &source_code);
            i += 2;
            if i < count {
                print!(", ");
            }
        }
        print!(")")
    }
}
Rust

This block takes care of the ruleBody production. If there is only one child, process it. If not, we have an alternative sequence, then we use the choice keyword.

src/main.rs
fn visit_symbol_seq(node: &Node<'_>, source_code: &str) {
    let count = node.child_count();
    if count == 1 {
        let child = node.child(0).unwrap();
        visit(&child, &source_code);
    } else {
        print!("seq(");
        for i in 0..count {
            visit(&node.child(i).unwrap(), &source_code);
            if i < count - 1 {
                print!(", ");
            }
        }
        print!(")")
    }
}
Rust

The visitor function for the sequence of symbols is similar to the previous one, but this time, without the need to increment two by two, as there is no delimiter between the symbols.

src/main.rs
fn visit_symbol(node: &Node<'_>, source_code: &str) {
    print!("$.");
    visit(&node.child(0).unwrap(), &source_code)
}
Rust

This is the last visitor function, which handles the symbol node. We print the prefix required by tree-sitter syntax, and visit its only child.

Now you can test the application as we did before, and see the nice output in tree-sitter syntax.

Leave a Reply