Tree-Sitter Tutorial – Part #2

In the last part of this tutorial we prepared the parser for a simple BNF syntax. In this part we will create a simple C tool that converts this parser syntax into tree-sitter JavaScript syntax.

In the last part of the tutorial, I used cargo to install the tree-sitter-cli tool. While that works perfectly well, it requires that the library, used to compile the parser, has the same version of the client tool. As compiling tree-sitter library from scratch is not that easy, I decided to remove the tool from my ~/.cargo folder, and install things using the Debian package manager:

sudo apt install libtree-sitter-dev tree-sitter-cli

After installing the library, as specified above or compiling from the source code, get into the folder we created in the first part of the tutorial. Then run make. Remember that, when we bootstrapped the project, a lot of files were created. One of those is a Makefile that allows you to quickly build a library file. Some new files will be created: lib-tree-sitter-bnf.so and lib-tree-sitter-bfn.a (if you are under Windows probably you will get just one file, ending in .dll). This is the parser library that we can link with our application.

The source used to compile this library is under the src folder. It was created when we used the tree-sitter generate command. There, you have the parser file, named parser.c, and some include headers, under src/tree_sitter.

Now let us create our application. Start by creating a new folder, named bnf2ts.

Reading the source from disk

To make things easier, we will use a tree-sitter API that receives a large string buffer and performs parsing there. Thus, we will need some code to read our input file (the file we want to parse) from the filesystem. For that, instead of using a library, I decided to ask our friend Chat-GPT to generate the code. After some minor improvements, here is the two files we will add to out project.

bnf2ts/file.h
#ifndef _FILE_H_
#define _FILE_H_

char* read_file_to_string(const char* filename);

#endif /* _FILE_H_ */
C

This is the header file we will include in our main program. It just declares the method that, given a filename, reads the whole file to a memory buffer.

bnf2ts/file.c
#include <stdio.h>
#include <stdlib.h>

char* read_file_to_string(const char* filename) {
    FILE* file = fopen(filename, "r");
    if (!file) {
        perror("Failed to open file");
        return NULL;
    }

    // Seek to the end to determine file size
    if (fseek(file, 0, SEEK_END) != 0) {
        perror("Failed to seek");
        fclose(file);
        return NULL;
    }

    long size = ftell(file);
    if (size < 0) {
        perror("Failed to tell file size");
        fclose(file);
        return NULL;
    }
    rewind(file); // Go back to the beginning

    // Allocate buffer (+1 for null terminator)
    char* buffer = (char*)malloc(size + 1);
    if (!buffer) {
        perror("Failed to allocate buffer");
        fclose(file);
        return NULL;
    }

    // Read the file
    size_t read_size = fread(buffer, 1, size, file);
    if (read_size != size) {
        perror("Failed to read entire file");
        free(buffer);
        fclose(file);
        return NULL;
    }

    buffer[size] = '\0'; // Null-terminate the string

    fclose(file);
    return buffer;
}
C

I suppose the code is well documented, and easy to understand. In fact, if you are having trouble reading it, you might have problems keeping up with the following part of the tutorial.

The basic mechanism is simple: open the file, seek the end of the file, get the file size, allocate the space required, seek the beginning of the file, and read the file contents. It just has a lot of checks to guarantee none of these processes failed miserably.

Parsing

The basic parser will be created in file bnf2ts.c. I will start by dumping here the whole program (well, our first version), and will then comment the relevant lines to make it easier to follow up.

bnf2ts/bnf2ts.c
#include <string.h>
#include <stdio.h>
#include <tree_sitter/api.h>

// Declares the function `TSLanguage *tree_sitter_bfn(void)`
#include <tree-sitter-bnf.h> // this was created by tree-sitter-cli

#include "file.h" // our file handling code

int main(int argc, char *argv[]) {
	if (argc != 2) {
          fprintf(stderr, "Usage: bnf2ts file.bnf\n");
          return 1;
	}

    char* filename = argv[1];
    char *source_code = read_file_to_string(filename);

    if (!source_code) return 1;

    // Create a parser.
    TSParser *parser = ts_parser_new();
  
    // Set the parser's language
    bool check = ts_parser_set_language(parser, tree_sitter_bnf());
    if (!check) {
       fprintf(stderr, "Library version mismatch\n");
       return 1;
    }
   
    // Build a syntax tree based on source code stored in a string.
    TSTree *tree = ts_parser_parse_string(parser, NULL, source_code, strlen(source_code));

    if (!tree) {
       fprintf(stderr, "Error parsing file\n");
       return 1;
    }
  
    // Free all of the heap-allocated memory.
    free(source_code);
    ts_tree_delete(tree);
    ts_parser_delete(parser);
    return 0;
}
C
  • Line 6: this includes a header file generated by the tree-sitter-cli. It just defines the prototype for TSLanguage *tree_sitter_bfn(void), so it can be simply replaced by this prototype. Nevertheless, I feel that including the header is more elegant.
  • Line 8: include our file library.
  • Lines 11 to 19: deal with the program parameters (we will receive the filename we want to parse as first argument to the bnf2ts tool), and read the file to memory.
  • Line 22: create a new tree-sitter generic parser.
  • Line 25: set the BNF language information into the tree-sitter parser (that is, configure the generic parser for our language). This call might return false if the grammar and the libraries have mismatched versions. Thus, added there an error message to help you debugging your environment.
  • Line 32: invoke the parser. The arguments are the parser, a pointer to an old parse tree (if we have it), the string to be parsed, and its size. Note that tree-sitter is able to deal with non-textual languages, where the \0 might be used as a symbol. Therefore we need to supply the buffer to parse as well as its size. As our language is textual, we can just compute the string size. This call returns the parse tree, or NULL in case of parsing errors.
  • Line 34: if the parser failed, print a message. If it does not fail, we will print nothing.
  • Lines 41 to 42: free all the allocated memory.

This code just parses the BNF program. If the syntax if valid, it prints nothing. If the syntax is invalid, it prints an error message.

To compile the program I used:

gcc bnf2ts.c file.c ../libtree-sitter-bnf.a -I ../bindings/c/tree_sitter/ -o bnf2ts -ltree-sitter
Bash

I am not compiling dynamically, but still using the built library. You can replace the library with the source code you have in the src folder. Its up to you.

gcc  bnf2ts.c file.c ../src/parser.c -I ../bindings/c/tree_sitter/ -o bnf2ts -ltree-sitter
Bash

We added the -I flag, to specify where to find the included files. And there is -ltree-sitter to link with the generic tree-sitter library.

This might be made cleaner, or created a Makefile. For now, we just want to test something 😁

./bnf2ts ../sample.bnf
Bash

Visiting the parsing tree

Well, we are writing in C, so things get a little more complicated. The following code is not the more elegant or efficient. I decided to go for a simple approach, easy to understand.

First, add a call to a Visit function, after the call to the parser:

bnf2ts/bnf2ts.c
  // ...
  // Get the root node of the syntax tree.
  TSNode root_node = ts_tree_root_node(tree);
  Visit(root_node, source_code);
  // ...
C

The Visit function will receive a node of the tree (in the code above, the root note) and the code we are parsing. Why this is useful will be clarified later. The structure of the Visit function is as follows:

bnf2ts/bnf2ts.c
void Visit(TSNode node, const char* source_code) {
  const char* name = ts_node_type(node);
  uint32_t child_count = ts_node_child_count(node);
  // ...
}
C

We get information about the current node name (it will return things like grammar, rule or ruleBody). We will use that name to decide how to visit each node, and act accordingly. We will not create an abstract tree, or any structure. We will output directly the conversion to the tree-sitter syntax.

We also get information about the number of children. For rules like grammar, the number varies accordingly with the number of rules it has. For other rules, like rule, the number of children is always the same (in this case, 4).

Now, the complete function follows.

bnf2ts/bnf2ts.c
void Visit(TSNode node, const char* source_code) {
  const char* name = ts_node_type(node);
  uint32_t child_count = ts_node_child_count(node);

  if (strcmp("rule", name) == 0) {
    Visit(ts_node_child(node, 0), source_code);
    printf(": $ => ");
    Visit(ts_node_child(node, 2), source_code);
    printf(",\n");
  }
  else if (strcmp("terminal", name) == 0 || strcmp("nonTerminal", name) == 0) {
    char* text = get_text(node, source_code);
    printf("%s", text);
    free(text);
  }
  else if (strcmp("symbol", name) == 0) {
    printf("$.");
    Visit(ts_node_child(node, 0), source_code);
  }
  else if (strcmp("symbolSeq", name) == 0) {
    if (child_count == 1)
      Visit(ts_node_child(node, 0), source_code);
    else {
      printf("seq(");
      for (int i = 0; i < child_count; i++)
      {
        Visit(ts_node_child(node, i), source_code);
        if (i < child_count - 1) printf(", ");
      }
      printf(")");
    }
  }
  else if (strcmp("ruleBody", name) == 0) {
    if (child_count == 1)
      Visit(ts_node_child(node, 0), source_code);
    else {
      printf("choice(");
      for (int i = 0; i < child_count; i+=2)
      {
        Visit(ts_node_child(node, i), source_code);
        if (i < child_count - 1) printf(", ");
      }
      printf(")");
    }
  }
  else if (strcmp("grammar", name) == 0) {
    printf("  rules: {\n");
    for (int i = 0; i < child_count; i++) {
      Visit(ts_node_child(node, i), source_code);
    }
    printf("   }\n");
  }
  else {
    TSSymbol node_type = ts_node_symbol(node);
    fprintf(stderr, "Unknown symbol %d: %s\n", node_type, name);
  }
}
C

The order of the rules is, more or less, related with the amount of times each node appears, so the order is not top-down regarding the grammar structure. But I will visit each visitor, by order.

  else if (strcmp("grammar", name) == 0) {
    printf("  rules: {\n");
    for (int i = 0; i < child_count; i++) {
      Visit(ts_node_child(node, i), source_code);
    }
    printf("   }\n");
  }
C

If the node is for the grammar rule, we know its children are of type rule. We add some boilerplate output to structure the tree-sitter JavaScript syntax (we are not creating the whole file), and visit each child. For that, we iterate on the number of child, get the child node with ts_node_child, and supply it to the Visit function.

  if (strcmp("rule", name) == 0) {
    Visit(ts_node_child(node, 0), source_code);
    printf(": $ => ");
    Visit(ts_node_child(node, 2), source_code);
    printf(",\n");
  }
C

The rule rule has four children. The first is the non-terminal associated with the rule, the arrow, the ruleBody non-terminal, and the semicolon. Thus, we are visiting the non-terminal (first child), then we output the required JavaScript to separate between the non-terminal and the rule contents, then we visit the rule body, and finally we add the JavaScript comma at the end of the line.

  else if (strcmp("ruleBody", name) == 0) {
    if (child_count == 1)
      Visit(ts_node_child(node, 0), source_code);
    else {
      printf("choice(");
      for (int i = 0; i < child_count; i+=2)
      {
        Visit(ts_node_child(node, i), source_code);
        if (i < child_count - 1) printf(", ");
      }
      printf(")");
    }
  }
C

The ruleBody code is a little more complicated. But not that complicated. The number of children will always be odd: 1, 3, 5, etc. In case there is just one children, we just visit it. If not, then we are looking to a choice (a list separated by |). For that, print the choice invocation, and then visit each of the children (but just the odd ones, are we are visiting only the non-terminals). Separate each child invocation with a comma. At the end, close the choice invocation.

  else if (strcmp("symbolSeq", name) == 0) {
    if (child_count == 1)
      Visit(ts_node_child(node, 0), source_code);
    else {
      printf("seq(");
      for (int i = 0; i < child_count; i++)
      {
        Visit(ts_node_child(node, i), source_code);
        if (i < child_count - 1) printf(", ");
      }
      printf(")");
    }
  }
C

The symbol sequence (symbolSeq) is similar to the ruleBody. The only difference is that we need to visit all symbols, as there is no separator between them.

  else if (strcmp("terminal", name) == 0 || strcmp("nonTerminal", name) == 0) {
    char* text = get_text(node, source_code);
    printf("%s", text);
    free(text);
  }
C

For the symbol non-terminal, we need to visit its child, but prefix with the $. characters:

  else if (strcmp("symbol", name) == 0) {
    printf("$.");
    Visit(ts_node_child(node, 0), source_code);
  }
C

Finally, for the terminal and non-terminal, we are just printing its name. We are using the get_text method that, given the node and the source code, returns the text, from the parsed text, that matched this symbol.

Before showing the code for the get_text function, just a note about the last block of the Visit function. It is to print symbols that the visitor is not handling. In the current situation, it should never happen, but if it does, it allows us to debug the code.

Finally the get_text function uses the source_code variable (that’s why we were using it all this time), and given the byte offset of the tree-sitter node, returns the matched text.

bnf2ts/bnf2ts.c
char* get_text(TSNode node, const char* source_code) {
  uint32_t start = ts_node_start_byte(node);
  uint32_t end = ts_node_end_byte(node);
  size_t len = end - start;

  return strndup(source_code + start, len);
}
C

So, we get the start and end byte, and compute the string length. Then use strndup to allocate the memory for that string chunk, and copy it there. Note that, given this is allocated, our visitor needs to free this memory.

With all this code in place we can compile the program and run it. You will get something very similar to the Tree-Sitter syntax:

JavaScript
  rules: {
Grammar: $ => choice($.Rule, seq($.Grammar, $.Rule)),
Rule: $ => seq($.nonTerminal, $.arrow, $.RuleBody, $.semicolon),
RuleBody: $ => choice($.SymbolSeq, seq($.RuleBody, $.pipe, $.SymbolSeq)),
SymbolSeq: $ => choice($.Symbol, seq($.SymbolSeq, $.Symbol)),
Symbol: $ => choice($.terminal, $.nonTerminal),
   }
JavaScript

More work is needed to specify the terminals, either as literals or regular expressions. But that will be part of a next tutorial part.

NOTE: update at 14th April, moving the print of $. from the terminal/non-terminal visitors into the symbol visitor, as the output was not correct accordingly to Tree-Sitter syntax.

Leave a Reply