Tree-Sitter Tutorial – Part #4

Continuing with Rust, in this part of the tutorial we will improve the grammar with support for a few new things. One will be the support of literals and regular expressions for terminal symbols, and the second will be the support for the Kleene closures.

Grammar Literals

The first changes we will perform are to support literals in the grammar, just like this:

sample.bnf

Grammar -> Rule | Grammar Rule;
Rule -> nonTerminal '->' RuleBody ';';
RuleBody -> SymbolSeq | RuleBody '|' SymbolSeq;
SymbolSeq -> Symbol | SymbolSeq Symbol;
Symbol -> terminal | nonTerminal;
terminal -> /[a-z_][A-Za-z0-9_]*/;
nonTerminal -> /[A-Z][A-Za-z0-9_]*/;

Plaintext

The main difference is that we now can write literals, like '->' or ';' directly in the grammar, as well as declaring regular expressions as terminal symbols.

To allow this, we will change the tree-sitter grammar, as follows:

grammar.js

rules: {
    grammar: $ => repeat1($.rule),
    rule: $ => seq($.nonTerminal, '->', $.ruleBody, ';'),
    ruleBody: $ => seq($.symbolSeq, repeat(seq('|', $.symbolSeq))),
    symbolSeq: $ => repeat1($.symbol),
    symbol: $ => choice($.nonTerminal, $.terminal),
    terminal: $ => choice($.pattern, $.literal),
    pattern: $ => /\/([^/]|\\\/)+\//,
    literal: $ => /'([^']|\\')+'/,
    nonTerminal: $ => /[A-Za-z_][A-Za-z0-9_]*/,
}

JavaScript

These are the main changes:

Line 12 now declares the arrow and semicolon directly as string literals;
Line 13 declares the pipe as a string literal;
Line 16 rewrites the terminal symbol as being either a pattern or a literal;
Line 17 declares how to recognize a pattern;
Line 18 declares how to recognize a literal;
Line 19 now allows symbols to start with upper or lowercase.

Special note for Regular Expressions fans. The pattern we use to recognize patterns is unable to recognize itself, as it would require a backslash before the slash in the negated class: /\/([^\/]|\\\/)+\//.

This should be enough. Let us try, and see if it works:

tree-sitter generate
tree-sitter parse sample.bnf

Bash

If everything looks good, that is, you got the output of a tree without any errors, you are ready to change the main rust code to handle this new parser.

First, we update our dispatch function to handle the new productions:

src/main.rs

fn visit(node: &Node<'_>, source_code: &str) {
    let kind = node.kind();
    
    match kind {
        "grammar" => visit_grammar(&node, &source_code),
        "rule" => visit_rule(&node, &source_code),
        "nonTerminal" => visit_non_terminal(&node, &source_code),
        "terminal" => visit_terminal(&node, &source_code),
        "ruleBody" => visit_rule_body(&node, &source_code),
        "symbolSeq" => visit_symbol_seq(&node, &source_code),
        "symbol" => visit_symbol(&node, &source_code),
        "pattern" => visit_pattern(&node, &source_code),
        "literal" => visit_literal(&node, &source_code),
        _ => println!("Node kind: {}", kind)
    }
}

Rust

No news here. Next step is to edit our visit_symbol function. By default, we added the $. prefix to any token. Nevertheless, we need to add that symbol for non-terminals only. Terminals, like literals or regular expressions (patterns), are typed directly:

src/main.rs

fn visit_symbol(node: &Node<'_>, source_code: &str) {
    let child = node.child(0).unwrap();
    if child.kind() == "nonTerminal" {
        print!("$.");
    }
    visit(&child, &source_code);
}

Rust

This visitor shows that we can drive our actions accordingly with the kind of token we have in our children. This way, the prefix is only printed for non-terminals.

src/main.rs

fn visit_pattern(node: &Node<'_>, source_code: &str) {
    let text = node.utf8_text(source_code.as_bytes()).unwrap();
    print!("{}", text);
}

fn visit_literal(node: &Node<'_>, source_code: &str) {
    let text = node.utf8_text(source_code.as_bytes()).unwrap();
    print!("{}", text);
}

fn visit_terminal(node: &Node<'_>, source_code: &str) {
    visit(&node.child(0).unwrap(), &source_code);
}

Rust

The new visitor for the terminal symbol now does nothing else than visiting its child (more on this soon). The visitors for patterns and literals are similar, and in fact, are the same as the visitor for the non-terminal symbol. The only difference is that we take advantage of the method utf8_text that allows us to get the matched text more easily than before (when we looked into the start and end offsets of the matched region).

And that’s all. You can run your rust application: cargo run sample.bnf

Let’s get back to the terminal symbol. Its presence in the grammar is just for readability. Edit the grammar.js file, and change these two lines:

grammar.js

    symbol: $ => choice($.nonTerminal, $._terminal),
    _terminal: $ => choice($.pattern, $.literal),

JavaScript

The main difference is that we prefixed the non-terminal symbol by an underscore. This underscore has a special meaning for tree-sitter: this is a fake node, and it will not be represented in the parse tree. Run the tree-sitter generate tool, and try to parse our sample again. It should work, but output a little different tree. There will be no terminal nodes.

Back to the Rust code, delete this line:

src/main.rs

        "terminal" => visit_terminal(&node, &source_code),

Rust

and delete this function:

src/main.rs

fn visit_terminal(node: &Node<'_>, source_code: &str) {
    visit(&node.child(0).unwrap(), &source_code);
}

Rust

Run your Rust application, and the output should be the same. Using fake nodes is very useful to simplify the parse tree, save some memory, and make the tree-processing easier.

Kleene Closures

The next step is to support the * and + operators. They will allow us to write our BNF grammar directly, as we have it in the tree-sitter syntax:

bnf-grammar.bnf

grammar -> rule+ ;
rule -> nonTerminal '->' ruleBody ';' ;
ruleBody -> symbolSeq ( '|' symbolSeq )* ;
symbolSeq -> symbol+ ;
symbol -> nonTerminal | _terminal ;
_terminal -> pattern | literal ;
pattern -> /\/([^\/]|\\\/)+\// ;
literal -> /'([^']|\\')+'/ ;
nonTerminal -> /[A-Za-z_][A-Za-z0-9_]*/ ;

Plaintext

To make this possible we need to allow symbols to have a qualifier, and allow symbols to be grouped together with parenthesis. Let’s write that down in tree-sitter. The new tree-sitter grammar looks as follows:

tree-sitter-bnf/grammar.js

rules: {
    grammar: $ => repeat1($.rule),
    rule: $ => seq($.nonTerminal, '->', $.ruleBody, ';'),
    ruleBody: $ => seq($.symbolSeq, repeat(seq('|', $.symbolSeq))),
    symbolSeq: $ => repeat1($.symbol),
    symbol: $ => seq(choice($.nonTerminal, $._terminal, $.subSeq), optional($._kleeneOp)),
    _kleeneOp: $ => choice($.plus, $.asterisk),
    plus: $ => '+',
    asterisk: $ => '*',
    subSeq: $ => seq('(', $.ruleBody, ')'),
    _terminal: $ => choice($.pattern, $.literal),
    pattern: $ => /\/([^/]|\\\/)+\//,
    literal: $ => /'([^']|\\')+'/,
    nonTerminal: $ => /[A-Za-z_][A-Za-z0-9_]*/,
}

JavaScript

The grammar starts to get hairier. But things are not that complicated. The changes, from the last grammar file are as follows:

Line 15: We keep the symbol production, but added an optional symbol at the end. Thus, a new sequences was created, that captures both the choice option we already had, but also an optional rule, that specifies that a Kleene operator might, or might not, follow the terminal or non-terminal symbol. We also added the concept of subsequences. Subsequences are just sequences that are delimited by parenthesis, allowing our BNF syntax to apply Kleene operators to groups of symbols. So, basically, we are parsing this kind of line: ruleBody -> symbolSeq ( '|' symbolSeq )* ;
Line 16: Here we create an anonymous rule (remember that if it starts with an underscore, it will not appear in the final parse tree). We specify that the Kleene operator might be a plus or an asterisk. Those terminals are specified in line 17 and 18.
Line 19: Defines the concept of subsequence: an opening parenthesis, a standard ruleBody content, and a closing parenthesis.

Run the tree-sitter generate command to create a new parser code, and get back to the Rust project. Let’s write the visitors!

First, the dispatcher. Nothing new here, only more lines, to handle more symbols:

bnf2ts/src/main.rs

 fn visit(node: &Node<'_>, source_code: &str) {
      let kind = node.kind();

      match kind {
          "grammar" => visit_grammar(&node, &source_code),
          "rule" => visit_rule(&node, &source_code),
          "nonTerminal" => visit_non_terminal(&node, &source_code),
          "ruleBody" => visit_rule_body(&node, &source_code),
          "symbolSeq" => visit_symbol_seq(&node, &source_code),
          "symbol" => visit_symbol(&node, &source_code),
          "pattern" => visit_pattern(&node, &source_code),
          "literal" => visit_literal(&node, &source_code),
          "subSeq" => visit_symbol_subseq(&node, &source_code),
          _ => println!("Node kind: {}", kind)
      }
  }

Rust

Keeping with simple code. The visitor for the subsequence is just a recursive call of the visit function for the second symbol (the content of the parenthesized expression):

bnf2ts/src/main.rs

  fn visit_symbol_subseq(node: &Node<'_>, source_code: &str) {
      visit(&node.child(1).unwrap(), &source_code);
  }

Rust

The only other change we need to perform is the visitor for the symbol non-terminal. Let’s first look into the code:

bnf2ts/src/main.rs

  fn visit_symbol(node: &Node<'_>, source_code: &str) {
      let child = node.child(0).unwrap();
      let mut kleene = "";
      if node.child_count() > 1 {
         kleene = node.child(1).unwrap().kind();
      }
      match kleene {
          "plus" => print!("repeat1("),
          "asterisk" => print!("repeat("),
          _ => {}
      }
      if child.kind() == "nonTerminal" {
          print!("$.");
      }
      visit(&child, &source_code);
      if kleene != "" {
          print!(")");
      }
  }

Rust

Now, step by step:

Line 82: Get the main child, that is the non-terminal symbol to which the Kleene operator might be applied. Note that this is the only required symbol in the sequence, so we can access to it without further checks.
Line 83: Create a mutable variable that will store information about the Kleene operator being applied.
Lines 85-87: Check if there is a second child. If it exists, there is a Kleene operator. Then, get that node kind, and store in the kleene variable.
Lines 87-91: If we have a Kleene operator in the kleene variable, write the respective tree-sitter call: a repeat1 for the + symbol, a repeat for the * symbol.
Lines 92-94: This is something we already had. If the child is a non-terminal, we need to print the tree-sitter prefix.
Line 95: The usual call to visit our child.
Lines 96-98: If there was a Kleene operator, print the required closing brace.

Now you can test your code: cargo run bnf-grammar.bnf.

/dev/null

Tree-Sitter Tutorial – Part #4

Grammar Literals

Kleene Closures

Leave a Reply