Tree-Sitter Tutorial – Part #6

As stated previously, Part #5 of this tutorial provides a brief overview of using Tree-Sitter from C#. This post continues from Part #4 and will add a Domain Object Model to the parser. It will not include many details specific to Tree-Sitter, as it will be more of a restructuring of our Rust code. Nevertheless, hope it gets handy both for people learning Tree-Sitter and/or Rust.

The test case

The test case used for this post is the same as before, the grammar for our BNF parser.

bnf-grammar.bnf
grammar -> rule+ ;
rule -> nonTerminal '->' ruleBody ';' ;
ruleBody -> symbolSeq ( '|' symbolSeq )* ;
symbolSeq -> symbol+ ;
symbol -> nonTerminal | _terminal ;
_terminal -> pattern | literal ;
pattern -> /\/([^/]|\\\/)+\// ;
literal -> /'([^']|\\')+'/ ;
nonTerminal -> /[A-Za-z_][A-Za-z0-9_]*/ ;
Plaintext

Save this file, as it will be useful for parsing our code.

Fixing the parser

One thing you might notice is that our previous parser cannot parse this grammar correctly. That is because of our regular expression to detect patterns (oh, the irony!). Let us fix that:

grammar.js
  pattern: $ => /\/([^/]|\[[^]]+\]|\\\/)+\//,
JavaScript

We were not allowing the use of the bare / in line 7 negation pattern: [^/]. If you see Part #4 of this tutorial, you will notice we had the pattern defined as /\/([^\/]|\\/)+\//.

Do not forget to run tree-sitter generate to recreate the parser.

Understanding if there are parsing errors

We expected that Tree-Sitter would not return a parsing tree in case of a parsing error. That’s not true. Tree-Sitter can recover and return a parsing tree with error nodes. We’re not dealing with that kind of node yet, so we can rewrite the main function to check for errors and report them.

src/main.rs
   // Call the parser
   let tree = parser.parse(&source_code, None)
                    .expect(&format!("Error parsing source code: {}", filename));
   let root_node = tree.root_node();
   if root_node.has_error() {
       panic!("There was some parsing error")
   }
Rust

The has_error method lets you know if a node or its child node is of the “Error” type. Parsing trees with error nodes can be easily processed, to allow an application to be robust enough to deal with possible errors.

The Data structures

We will create three data structures: one for the Grammar, one for a Production, and another for a Production Node. I will add them to a file named dom.rs. The Grammar is a list of productions:

src/dom.rs
pub struct Grammar {
    pub productions: Vec<Production>
}
Rust

If you are learning Rust, the pub keyword makes the structure and the access to the field public to external crates.

The Production structure is quite simple as well:

src/dom.rs
pub struct Production {
    pub name: String,
    pub body: GrammarNode
}
Rust

The name is the left-hand side (LHS) of the production. The body will be the right-hand side (RHS) of the production. For the name, we will use a standard string. For the body, we will define a new data structure.

src/dom.rs
pub enum GrammarNode {
    Sequence(Vec<GrammarNode>),
    Choice(Vec<GrammarNode>),
    TerminalLiteral(String),
    TerminalPattern(String),
    NonTerminal(String),
    ZeroOrMore(Box<GrammarNode>),
    OneOrMore(Box<GrammarNode>),
}
Rust

Here, we take advantage of the flexibility of the enumeration data types. Each possible value of an enumeration can have a custom value. We can use this to create a simulation of an object-oriented hierarchy. A node can be any of the specified types and will store the relevant data.

The main note I want to stress here is the usage of the Box object. A Box<T> is a smart pointer that allocates data on the heap instead of the stack, which enables recursive or large data structures.

Data Structures Display

Given that we will create a structure to store the grammar, we will also need to implement a way to print it using the Tree-Sitter syntax, just as we were doing before. For that, we will implement the Display trait from the fmt crate.

First, add the relevant imports:

src/dom.rs
use std::fmt;
use std::fmt::{Display, Formatter};
Rust

The display method for the Grammar object will be implemented this way:

src/dom.rs
impl Display for Grammar {
    fn fmt(&self, fmt: &mut Formatter<'_>) -> fmt::Result {
        for production in &self.productions {
            write!(fmt, "{}\n", production)?;
        }
        Ok(())
    }
}
Rust

The impl block lets us specify how to implement methods for a specific object type. The impl .. for syntax allows the trait implementation for a specific object type. The fmt method is called whenever the object is printed. In this case, we cycle through the productions and print each one of them. We also use the ? syntax, that allows us to propagate the Result object returned by write!. Please read the Rust documentation for further details.

The Display implementation for the Production structure is quite straightforward as well:

src/dom.rs
impl Display for Production {
    fn fmt(&self, fmt: &mut Formatter<'_>) -> fmt::Result {
        fmt.write_str(&self.name)?;
        fmt.write_str(" -> ")?;
        write!(fmt, "{}", &self.body)?;
        Ok(())
    }
}
Rust

Here we use the write_str method for strings, and keep using the write! macro for messages using formats.

src/dom.rs
impl Display for GrammarNode {
    fn fmt(&self, f: &mut Formatter<'_>) -> fmt::Result {
        match self {
            GrammarNode::Sequence(s) => {
                write!(f, "seq({})", 
                    s.iter().map(|x| x.to_string()).collect::<Vec<_>>().join(", "))
            }
            GrammarNode::Choice(c) => {
                write!(f, "choice({})", 
                    c.iter().map(|x| x.to_string()).collect::<Vec<_>>().join(", "))
            }
            GrammarNode::TerminalLiteral(l) => { write!(f, "{}", l)}
            GrammarNode::TerminalPattern(p) => { write!(f, "{}", p)}
            GrammarNode::NonTerminal(nt) => { write!(f, "$.{}", nt)}
            GrammarNode::ZeroOrMore(zm) => {write!(f, "repeat({})", zm)}
            GrammarNode::OneOrMore(om) => {write!(f, "repeat1({})", om)}
        }
    }
}
Rust

The Display method for the GrammarNode type dispatches the format accordingly with the type, using the match statement. Each rule prints the element accordingly with the desired output. The less clear lines are lines 47 and 51, as we run a map to the to_string method, whose result is then joined by a comma. The iter creates an iterator, that runs through the map. The map results are collected into a vector, so that they can then be joined by the comma.

The Visitors

Our visitors will be rewritten, and will me moved to their own rust file. Create the visitors.rs file, and start adding code. First, the imports:

src/visitors.rs
use tree_sitter::Node;
use crate::dom::{Grammar, GrammarNode, Production};
use crate::dom::GrammarNode::*;
Rust

For debugging, I created this method, that panics if a tree-sitter node kind is not the expected one. While it should not be the case while running the final project, it can be useful for debugging. Therefore, I am sharing it, in case you need a similar approach.

src/visitors.rs
fn ensure_node_type(node: &Node, node_type: &str) {
    if node.kind() != node_type {
        panic!("Expected node type {} but got {}", node_type, node.kind());
    }
}
Rust

The visitor for the grammar production is just a call for the visitor of the rule, and the storage of the return value of each rule in the Grammar vector:

src/visitors.rs
pub fn visit_grammar(node: &Node<'_>, source_code: &str) -> Grammar {
    ensure_node_type(node, "grammar");
    let mut grammar = Grammar { productions: Vec::new() };

    let count = node.child_count();
    for i in 0..count {
        let child = node.child(i).unwrap();
        let production = visit_rule(&child, &source_code);
        grammar.productions.push(production);
    }
    return grammar;
}
Rust
  • Line 11: We start defining the return type of the visitor. Up now, our visitors did not return any value. Now, we are constructing a data structure, and therefore, we need to return the parts of that structure, so that they can be glued together.
  • Line 12: Call the auxiliary method to ensure the received node has the right kind.
  • Line 13: Construct the grammar object that will be returned by the visitor.
  • Line 18: Call the visit_rule method, that will return a Production.
  • Line 19: Add the obtained production in the grammar vector of productions.
  • Line 21: Do not forget to return the constructed object.

Note that this is the only public visitors. All the other ones are called inside this module, so not needing the pub modifier.

src/visitors.rs
fn visit_rule(node: &Node<'_>, source_code: &str) -> Production {
    ensure_node_type(node, "rule");

    let rule_name = node.child(0).unwrap();
    let rule_body = node.child(2).unwrap();

    let NonTerminal(name) = visit(&rule_name, &source_code) else {
        panic!("Non Non-Terminal on LHS of production")
    };
    let body = visit(&rule_body, &source_code);

    return Production { name, body };
}
Rust

The visitor for the rule production has the following details:

  • Line 24: specify the correct return type for the visitor.
  • Line 25: we kept here the sanity test to ensure the node has the right kind.
  • Line 27 and 28: As before, we get the nodes for the rule name and body.
  • Line 30: call the generic visit method on the rule name. We know the rule name is always a non terminal. But Rust is not able to confirm that, so we use the let ... else construct to deconstruct the non-terminal, and panic in case that is not possible.
  • Line 33: call the visitor for the rule body.
  • Line 35: return the new production object.
src/visitors.rs
fn visit_non_terminal(node: &Node<'_>, source_code: &str) -> GrammarNode {
    let text = node.utf8_text(source_code.as_bytes()).unwrap();
    return NonTerminal(format!("{}", text));
}

fn visit_pattern(node: &Node<'_>, source_code: &str) -> GrammarNode {
    let text = node.utf8_text(source_code.as_bytes()).unwrap();
    return TerminalPattern(format!("{}", text));
}

fn visit_literal(node: &Node<'_>, source_code: &str) -> GrammarNode {
    let text = node.utf8_text(source_code.as_bytes()).unwrap();
    return TerminalLiteral(format!("{}", text));
}
Rust

These three visitors are very similar. We just be sure to define they return a grammar node, and create the right object, accordingly with the parser node kind.

The ruleBody production visitor is similar to what we had before, but instead of printing, we store the return values of each visit call:

src/visitors.rs
fn visit_rule_body(node: &Node<'_>, source_code: &str) -> GrammarNode {
    let count = node.child_count();
    if count == 1 {
        let child = node.child(0).unwrap();
        return visit(&child, &source_code);
    } else {
        let mut choice = Vec::new();
        let mut i = 0;
        while i < count {
            choice.push(visit(&node.child(i).unwrap(), &source_code));
            i += 2;
        }
        return Choice(choice);
    }
}
Rust

The sequence visitor is mostly the same, creating a Sequence object, instead of a Choice.

src/visitors.rs
fn visit_symbol_seq(node: &Node<'_>, source_code: &str) -> GrammarNode {
    let count = node.child_count();
    if count == 1 {
        let child = node.child(0).unwrap();
        return visit(&child, &source_code);
    } else {
        let mut seq = Vec::new();
        for i in 0..count {
            seq.push(visit(&node.child(i).unwrap(), &source_code));
        }
        return Sequence(seq);
    }
}
Rust

The following visitor is kept unmodified, except for the return type:

src/visitors.rs
fn visit_symbol_subseq(node: &Node<'_>, source_code: &str) -> GrammarNode {
    return visit(&node.child(1).unwrap(), &source_code);
}
Rust

Finally, the symbol visitor has a few more changes:

src/visitors.rs
fn visit_symbol(node: &Node<'_>, source_code: &str) -> GrammarNode {
    let child = node.child(0).unwrap();
    let symbol = visit(&child, &source_code);

    let mut kleene = "";
    if node.child_count() > 1 {
        kleene = node.child(1).unwrap().kind();
    }

    return match kleene {
        "plus" => OneOrMore(Box::new(symbol)),
        "asterisk" => ZeroOrMore(Box::new(symbol)),
        _ => symbol
    }
}
Rust

As before, we visit the symbol child. Then, in case we have a modifier we get it. Then, the modifier is used to decide if we can return the symbol directly or if we need to create a repeating object. If we need to create such an object, we need to wrap the symbol in a Box<T> object, as explained above.

The main visitor dispatcher is mostly unmodified as well:

src/visitors.rs
fn visit(node: &Node<'_>, source_code: &str) -> GrammarNode {
    let kind = node.kind();

    match kind {
        "nonTerminal" => visit_non_terminal(&node, &source_code),
        "ruleBody"    => visit_rule_body(&node, &source_code),
        "symbolSeq"   => visit_symbol_seq(&node, &source_code),
        "symbol"      => visit_symbol(&node, &source_code),
        "pattern"     => visit_pattern(&node, &source_code),
        "literal"     => visit_literal(&node, &source_code),
        "subSeq"      => visit_symbol_subseq(&node, &source_code),
        _             => panic!("Unknown node kind: {}", kind)
    }
}
Rust

The main

Regarding main.rs we need to add the reference to external modules:

src/main.rs
mod dom;
mod visitors;

use std::env;
use std::fs::File;
use std::io::Read;

use crate::visitors::visit_grammar;
use tree_sitter::Parser;
Rust

And add the right call to the grammar visitor at the end:

Rust
    // Call the parser
    let tree = parser.parse(&source_code, None)
                     .expect(&format!("Error parsing source code: {}", filename));

    let root_node = tree.root_node();
    if root_node.has_error() {
        panic!("There was some parsing error")
    }
    let grammar = visit_grammar(&root_node, &source_code);
    println!("{}", grammar);
}
Rust

You can now compile your code and run it against the bnf-grammar.bnf: cargo run bnf-grammar.bnf

The code for the tree-sitter grammar and for the Rust app are both available here.

Leave a Reply