As stated previously, Part #5 of this tutorial provides a brief overview of using Tree-Sitter from C#. This post continues from Part #4 and will add a Domain Object Model to the parser. It will not include many details specific to Tree-Sitter, as it will be more of a restructuring of our Rust code. Nevertheless, hope it gets handy both for people learning Tree-Sitter and/or Rust.
The test case
The test case used for this post is the same as before, the grammar for our BNF parser.
grammar -> rule+ ;
rule -> nonTerminal '->' ruleBody ';' ;
ruleBody -> symbolSeq ( '|' symbolSeq )* ;
symbolSeq -> symbol+ ;
symbol -> nonTerminal | _terminal ;
_terminal -> pattern | literal ;
pattern -> /\/([^/]|\\\/)+\// ;
literal -> /'([^']|\\')+'/ ;
nonTerminal -> /[A-Za-z_][A-Za-z0-9_]*/ ;
PlaintextSave this file, as it will be useful for parsing our code.
Fixing the parser
One thing you might notice is that our previous parser cannot parse this grammar correctly. That is because of our regular expression to detect patterns (oh, the irony!). Let us fix that:
pattern: $ => /\/([^/]|\[[^]]+\]|\\\/)+\//,
JavaScriptWe were not allowing the use of the bare /
in line 7 negation pattern: [^/]
. If you see Part #4 of this tutorial, you will notice we had the pattern defined as /\/([^\/]|\\/)+\//
.
Do not forget to run tree-sitter generate
to recreate the parser.
Understanding if there are parsing errors
We expected that Tree-Sitter would not return a parsing tree in case of a parsing error. That’s not true. Tree-Sitter can recover and return a parsing tree with error nodes. We’re not dealing with that kind of node yet, so we can rewrite the main function to check for errors and report them.
// Call the parser
let tree = parser.parse(&source_code, None)
.expect(&format!("Error parsing source code: {}", filename));
let root_node = tree.root_node();
if root_node.has_error() {
panic!("There was some parsing error")
}
RustThe has_error
method lets you know if a node or its child node is of the “Error” type. Parsing trees with error nodes can be easily processed, to allow an application to be robust enough to deal with possible errors.
The Data structures
We will create three data structures: one for the Grammar, one for a Production, and another for a Production Node. I will add them to a file named dom.rs
. The Grammar is a list of productions:
pub struct Grammar {
pub productions: Vec<Production>
}
RustIf you are learning Rust, the pub
keyword makes the structure and the access to the field public to external crates.
The Production structure is quite simple as well:
pub struct Production {
pub name: String,
pub body: GrammarNode
}
RustThe name
is the left-hand side (LHS) of the production. The body
will be the right-hand side (RHS) of the production. For the name, we will use a standard string. For the body, we will define a new data structure.
pub enum GrammarNode {
Sequence(Vec<GrammarNode>),
Choice(Vec<GrammarNode>),
TerminalLiteral(String),
TerminalPattern(String),
NonTerminal(String),
ZeroOrMore(Box<GrammarNode>),
OneOrMore(Box<GrammarNode>),
}
RustHere, we take advantage of the flexibility of the enumeration data types. Each possible value of an enumeration can have a custom value. We can use this to create a simulation of an object-oriented hierarchy. A node can be any of the specified types and will store the relevant data.
The main note I want to stress here is the usage of the Box
object. A Box<T>
is a smart pointer that allocates data on the heap instead of the stack, which enables recursive or large data structures.
Data Structures Display
Given that we will create a structure to store the grammar, we will also need to implement a way to print it using the Tree-Sitter syntax, just as we were doing before. For that, we will implement the Display
trait from the fmt
crate.
First, add the relevant imports:
use std::fmt;
use std::fmt::{Display, Formatter};
RustThe display method for the Grammar object will be implemented this way:
impl Display for Grammar {
fn fmt(&self, fmt: &mut Formatter<'_>) -> fmt::Result {
for production in &self.productions {
write!(fmt, "{}\n", production)?;
}
Ok(())
}
}
RustThe impl
block lets us specify how to implement methods for a specific object type. The impl .. for
syntax allows the trait implementation for a specific object type. The fmt
method is called whenever the object is printed. In this case, we cycle through the productions and print each one of them. We also use the ?
syntax, that allows us to propagate the Result
object returned by write!
. Please read the Rust documentation for further details.
The Display implementation for the Production structure is quite straightforward as well:
impl Display for Production {
fn fmt(&self, fmt: &mut Formatter<'_>) -> fmt::Result {
fmt.write_str(&self.name)?;
fmt.write_str(" -> ")?;
write!(fmt, "{}", &self.body)?;
Ok(())
}
}
RustHere we use the write_str
method for strings, and keep using the write!
macro for messages using formats.
impl Display for GrammarNode {
fn fmt(&self, f: &mut Formatter<'_>) -> fmt::Result {
match self {
GrammarNode::Sequence(s) => {
write!(f, "seq({})",
s.iter().map(|x| x.to_string()).collect::<Vec<_>>().join(", "))
}
GrammarNode::Choice(c) => {
write!(f, "choice({})",
c.iter().map(|x| x.to_string()).collect::<Vec<_>>().join(", "))
}
GrammarNode::TerminalLiteral(l) => { write!(f, "{}", l)}
GrammarNode::TerminalPattern(p) => { write!(f, "{}", p)}
GrammarNode::NonTerminal(nt) => { write!(f, "$.{}", nt)}
GrammarNode::ZeroOrMore(zm) => {write!(f, "repeat({})", zm)}
GrammarNode::OneOrMore(om) => {write!(f, "repeat1({})", om)}
}
}
}
RustThe Display method for the GrammarNode
type dispatches the format accordingly with the type, using the match
statement. Each rule prints the element accordingly with the desired output. The less clear lines are lines 47 and 51, as we run a map to the to_string
method, whose result is then joined by a comma. The iter
creates an iterator, that runs through the map
. The map
results are collect
ed into a vector, so that they can then be joined by the comma.
The Visitors
Our visitors will be rewritten, and will me moved to their own rust file. Create the visitors.rs
file, and start adding code. First, the imports:
use tree_sitter::Node;
use crate::dom::{Grammar, GrammarNode, Production};
use crate::dom::GrammarNode::*;
RustFor debugging, I created this method, that panics if a tree-sitter node kind is not the expected one. While it should not be the case while running the final project, it can be useful for debugging. Therefore, I am sharing it, in case you need a similar approach.
fn ensure_node_type(node: &Node, node_type: &str) {
if node.kind() != node_type {
panic!("Expected node type {} but got {}", node_type, node.kind());
}
}
RustThe visitor for the grammar production is just a call for the visitor of the rule, and the storage of the return value of each rule in the Grammar vector:
pub fn visit_grammar(node: &Node<'_>, source_code: &str) -> Grammar {
ensure_node_type(node, "grammar");
let mut grammar = Grammar { productions: Vec::new() };
let count = node.child_count();
for i in 0..count {
let child = node.child(i).unwrap();
let production = visit_rule(&child, &source_code);
grammar.productions.push(production);
}
return grammar;
}
Rust- Line 11: We start defining the return type of the visitor. Up now, our visitors did not return any value. Now, we are constructing a data structure, and therefore, we need to return the parts of that structure, so that they can be glued together.
- Line 12: Call the auxiliary method to ensure the received node has the right kind.
- Line 13: Construct the grammar object that will be returned by the visitor.
- Line 18: Call the
visit_rule
method, that will return aProduction
. - Line 19: Add the obtained production in the grammar vector of productions.
- Line 21: Do not forget to return the constructed object.
Note that this is the only public visitors. All the other ones are called inside this module, so not needing the pub
modifier.
fn visit_rule(node: &Node<'_>, source_code: &str) -> Production {
ensure_node_type(node, "rule");
let rule_name = node.child(0).unwrap();
let rule_body = node.child(2).unwrap();
let NonTerminal(name) = visit(&rule_name, &source_code) else {
panic!("Non Non-Terminal on LHS of production")
};
let body = visit(&rule_body, &source_code);
return Production { name, body };
}
RustThe visitor for the rule
production has the following details:
- Line 24: specify the correct return type for the visitor.
- Line 25: we kept here the sanity test to ensure the node has the right kind.
- Line 27 and 28: As before, we get the nodes for the rule name and body.
- Line 30: call the generic
visit
method on the rule name. We know the rule name is always a non terminal. But Rust is not able to confirm that, so we use thelet ... else
construct to deconstruct the non-terminal, and panic in case that is not possible. - Line 33: call the visitor for the rule body.
- Line 35: return the new production object.
fn visit_non_terminal(node: &Node<'_>, source_code: &str) -> GrammarNode {
let text = node.utf8_text(source_code.as_bytes()).unwrap();
return NonTerminal(format!("{}", text));
}
fn visit_pattern(node: &Node<'_>, source_code: &str) -> GrammarNode {
let text = node.utf8_text(source_code.as_bytes()).unwrap();
return TerminalPattern(format!("{}", text));
}
fn visit_literal(node: &Node<'_>, source_code: &str) -> GrammarNode {
let text = node.utf8_text(source_code.as_bytes()).unwrap();
return TerminalLiteral(format!("{}", text));
}
RustThese three visitors are very similar. We just be sure to define they return a grammar node, and create the right object, accordingly with the parser node kind.
The ruleBody production visitor is similar to what we had before, but instead of printing, we store the return values of each visit call:
fn visit_rule_body(node: &Node<'_>, source_code: &str) -> GrammarNode {
let count = node.child_count();
if count == 1 {
let child = node.child(0).unwrap();
return visit(&child, &source_code);
} else {
let mut choice = Vec::new();
let mut i = 0;
while i < count {
choice.push(visit(&node.child(i).unwrap(), &source_code));
i += 2;
}
return Choice(choice);
}
}
RustThe sequence visitor is mostly the same, creating a Sequence
object, instead of a Choice
.
fn visit_symbol_seq(node: &Node<'_>, source_code: &str) -> GrammarNode {
let count = node.child_count();
if count == 1 {
let child = node.child(0).unwrap();
return visit(&child, &source_code);
} else {
let mut seq = Vec::new();
for i in 0..count {
seq.push(visit(&node.child(i).unwrap(), &source_code));
}
return Sequence(seq);
}
}
RustThe following visitor is kept unmodified, except for the return type:
fn visit_symbol_subseq(node: &Node<'_>, source_code: &str) -> GrammarNode {
return visit(&node.child(1).unwrap(), &source_code);
}
RustFinally, the symbol
visitor has a few more changes:
fn visit_symbol(node: &Node<'_>, source_code: &str) -> GrammarNode {
let child = node.child(0).unwrap();
let symbol = visit(&child, &source_code);
let mut kleene = "";
if node.child_count() > 1 {
kleene = node.child(1).unwrap().kind();
}
return match kleene {
"plus" => OneOrMore(Box::new(symbol)),
"asterisk" => ZeroOrMore(Box::new(symbol)),
_ => symbol
}
}
RustAs before, we visit the symbol child. Then, in case we have a modifier we get it. Then, the modifier is used to decide if we can return the symbol directly or if we need to create a repeating object. If we need to create such an object, we need to wrap the symbol in a Box<T>
object, as explained above.
The main visitor dispatcher is mostly unmodified as well:
fn visit(node: &Node<'_>, source_code: &str) -> GrammarNode {
let kind = node.kind();
match kind {
"nonTerminal" => visit_non_terminal(&node, &source_code),
"ruleBody" => visit_rule_body(&node, &source_code),
"symbolSeq" => visit_symbol_seq(&node, &source_code),
"symbol" => visit_symbol(&node, &source_code),
"pattern" => visit_pattern(&node, &source_code),
"literal" => visit_literal(&node, &source_code),
"subSeq" => visit_symbol_subseq(&node, &source_code),
_ => panic!("Unknown node kind: {}", kind)
}
}
RustThe main
Regarding main.rs
we need to add the reference to external modules:
mod dom;
mod visitors;
use std::env;
use std::fs::File;
use std::io::Read;
use crate::visitors::visit_grammar;
use tree_sitter::Parser;
RustAnd add the right call to the grammar visitor at the end:
// Call the parser
let tree = parser.parse(&source_code, None)
.expect(&format!("Error parsing source code: {}", filename));
let root_node = tree.root_node();
if root_node.has_error() {
panic!("There was some parsing error")
}
let grammar = visit_grammar(&root_node, &source_code);
println!("{}", grammar);
}
RustYou can now compile your code and run it against the bnf-grammar.bnf
: cargo run bnf-grammar.b
nf
The code for the tree-sitter grammar and for the Rust app are both available here.