ANTLR Tutorial – Part II

Continuing with the ANTLR tutorial, started in my last post, this second part advances with three main goals:

Simplification of the grammar, taking advantage of the ANTLR skip option;
File organization, with the creation of a namespace and putting files in different folders: this will require some minor changes in the BuildAntlr project;
The first version of our Abstract Syntax Tree creator. Sorry, we will not generate SVG code as yet.

Grammar Simplification

The grammar presented in the last post was somewhat cucumbersome. The way we dealt with whitespace characters made the grammar difficult to read, as we needed to take care of where spaces could appear in the Logo program.

To make the grammar simpler we can take advantage of the ANTLR skip option. It can be used for tokens that will be hidden from the grammar. They are just ignored.

Logo.g4

grammar Logo;

// Parser

program : command+ EOF
        ;

command : Right Value
        | Forward Value
        ;

// Lexer

Right   : 'RIGHT' | 'RT' ;
Forward : 'FORWARD' | 'FD' ;

Value   : [0-9]+ ;           
            
White   : [ \n\t\r] -> skip;

Plaintext

This version of the grammar is much easier to understand. A command is the Right or the Forward command, followed by the numeric argument. A program is just a sequence of commands. Easy. Note that in the lexer we specify that whitespace characters are to be skipped, using the arrow syntax.

Project Structure

To make our code easier to understand, we will create some folders inside the Logo2Svg project, as well as do some file renamings:

1. Move the Grammar

The grammar will be moved to its folder, named Language. We will instruct ANTLR to add its generated files in that folder, making it easier to understand what is generated code and what is our code.

mkdir Logo2Svg/Language
mv Logo2Svg/Logo.g4 Logo2Svg/Language

Bash

2. Create Folders

We will create the AST folder, where new files will be created for the construction of our Abstract Syntax Tree. Details on what an AST is will be discussed later. We will also create a folder named Turtle where the code that handles the Logo turtle and code generation will be placed.

mkdir Logo2Svg/AST
mkdir Logo2Svg/Turtle

Bash

3. File Renaming

Our logo2svg.cs file will be renamed into App.cs as we will use Logo2Svg as the namespace for our project. While it is possible to reuse a class name as a namespace name, it gets confusing.

mv Logo2Svg/logo2svg.cs Logo2Svg/App.cs

Bash

Do not forget to edit the class name in the file, and take the chance to add the namespace line. Using a namespace is a C# good practice.

App.cs

using Antlr4.Runtime;

namespace Logo2Svg;

static class App {
  static int Main(string[] args)

// [...]

4. Add a Namespace to ANTLR Generated Code

We need to make sure that the ANTLR code is generated in the folder we just created, and that a namespace is also added to all generated files. To achieve this we will need to add two new fields to the BuildAntlr package: one with the path where the files are to be generated, and the other, the name of the namespace to be used.

BuildAntlr.cs

// [...]
   
    public class BuildAntlr : Task
    {
        public string Namespace { get; set; }
        public string OutputDir { get; set; }

// [...]

Now, these two fields need to be used in the antlr invocation. Replace the proc object creation with these lines:

BuildAntlr.cs

// [...]

        var output = string.IsNullOrEmpty(OutputDir) ? "" : $"-o {OutputDir}";
        var @namespace = string.IsNullOrEmpty(Namespace) ? "" : $"-package {Namespace}";

        var proc = new Process
        {
            StartInfo = new ProcessStartInfo
            {
                FileName = JavaPath,
                Arguments = $"-Xmx500M -cp {cp} org.antlr.v4.Tool {visitor} {listener} {output} {@namespace} -Dlanguage=CSharp {files}",
                UseShellExecute = false,
                RedirectStandardOutput = false,
                CreateNoWindow = true
            }
        };

// [...]

As we did not specify these two fields as required, we need to take care if they are not defined. Thus, if they are empty, we just use an empty string, and the tool will do whatever it did until now. If we specify the fields, the options -o for the output directory and -package to specify the namespace name will be used.

The use of @namespace is just a way to make C# aware that we are using a reserved name as a variable name.

5. Update the Logo2Svg Project Configuration

Finally, the Logo2Svg.csproj needs to be updated. As there are a lot of differences, the full file is shared below.

Logo2Svg.csproj

<Project Sdk="Microsoft.NET.Sdk">

  <UsingTask TaskName="BuildAntlr.BuildAntlr" AssemblyFile="../BuildAntlr/bin/Debug/netstandard2.0/BuildAntlr.dll"/>

  <PropertyGroup>
    <OutputType>Exe</OutputType>
    <TargetFramework>net7.0</TargetFramework>
    <ImplicitUsings>enable</ImplicitUsings>
    <Nullable>disable</Nullable>

    <RootFolder>$(MSBuildProjectDirectory)</RootFolder>
    <GrammarPath>$(RootFolder)/Language</GrammarPath>
  </PropertyGroup>

  <ItemGroup>
    <PackageReference Include="Antlr4.Runtime.Standard" Version="4.13.1" />
  </ItemGroup>

  <Target Name="BuildAntlr" BeforeTargets="CoreCompile" Inputs="$(GrammarPath)/Logo.g4" Outputs="$(GrammarPath)/LogoParser.cs $(GrammarPath)/LogoLexer.cs $(GrammarPath)/LogoBaseVisitor.cs $(GrammarPath)/LogoVisitor.cs">
     <BuildAntlr
        JavaPath="/usr/bin/java"
        AntlrJar="/opt/local/lib/antlr-4.13.1-complete.jar"
        Listeners="false"
        Namespace="Logo2Svg.Language"
        Visitors="true"
        OutputDir="$(GrammarPath)"
        Files="$(GrammarPath)/Logo.g4">
     </BuildAntlr>
     <ItemGroup>
        <Compile Include="$(GrammarPath)/LogoParser.cs" />
        <Compile Include="$(GrammarPath)/LogoLexer.cs" />
        <Compile Include="$(GrammarPath)/LogoVisitor.cs" />
        <Compile Include="$(GrammarPath)/LogoBaseVisitor.cs" />
     </ItemGroup>
  </Target>

  <Target Name="CleanGeneratedFiles" BeforeTargets="Build">
        <Delete Files="$(GrammarPath)/LogoParser.cs" />
        <Delete Files="$(GrammarPath)/LogoLexer.cs" />
        <Delete Files="$(GrammarPath)/LogoVisitor.cs" />
        <Delete Files="$(GrammarPath)/LogoBaseVisitor.cs" />
  </Target>

</Project>

XML

Main differences:

Line 12 defines a new variable, GrammarPath with the location where the grammar is, as well as the location where the generated files are to be placed.
Line 19 was updated to use this new variable
Line 24 defines the namespace we want, and line 26 specifies the output folder. Note that the order is irrelevant.
Blocks starting on lines 30-33 and 38-41 were updated to use the new path.

Take the chance to update the .gitignore in case you are using GIT.

.gitignore

*/obj
*/bin
Logo2Svg/Language/LogoBaseVisitor.cs
Logo2Svg/Language/Logo.interp
Logo2Svg/Language/LogoLexer.cs
Logo2Svg/Language/LogoLexer.interp
Logo2Svg/Language/LogoLexer.tokens
Logo2Svg/Language/LogoParser.cs
Logo2Svg/Language/Logo.tokens
Logo2Svg/Language/LogoVisitor.cs

Plaintext

6. Update Namespace References

As you can see in the Logo2Svg.csproj file, line 24 defines a subnamespace: Logo2Svg.Language. We need to update the App.cs code to work with this, whenever referring to ANTLR generated code. Just add the correct using directive on top of that file.

App.cs

using Logo2Svg.Language;

Before compiling your code be sure you delete all the generated files from the root of the Logo2Svg folder.

Abstract Syntax Tree

An abstract syntax tree is a structure that represents our program abstractly. It does not specify the behaviour of the program. It just depicts how the commands are interrelated. One AST should have all the information required to understand the original code and execute it. The following image was copied from the Wikipedia article on AST and shows the Euclidean algorithm.

While not strictly necessary when creating an AST with ANTLR, whenever you create a tree in computer science it is good practice that all nodes share a common abstract class or interface. When to choose an interface or an abstract class is a complex discussion. I will use an interface as I do not intend to have shared code for the different node types. The interface will be named INode. For this part of the tutorial, the interface will be empty. It will be created in the Logo2Svg/AST folder.

AST/INode.cs

namespace Logo2Svg.AST
{
    public interface INode
    {
    }
}

There are two types of nodes we will create for now: the Command, that will represent any atomic command, like the Forward or Right commands, and the Program, that will represent a sequence of commands (the whole program).

Create AST/Command.cs with the following code:

AST/Command.cs

using Logo2Svg.Language;

namespace Logo2Svg.AST
{
    public class Command : INode
    {
        public string Name { get; }
        public int Value { get; }

        public Command(string command, string value)
        {
            Name = command;
            if (int.TryParse(value, out var intVal))
            {
                Value = intVal;
            }
        }

        public override string ToString() => $"{Name}({Value})";
    }
}

The command will have two fields, the command name, a string (for now) and its parameter (an integer value, for now as well). The constructor receives both as string values, stores the command name, and converts the value from string to integer. A stringification method was also added for debugging purposes.

Now, create the AST/Program.cs file with the following code:

AST/Program.cs

namespace Logo2Svg.AST
{
    public class Program : List<Command>, INode
    {
        public override string ToString() => string.Join("\n", this);
    }
}

A program is a list of commands (for now), and as such, we can make it inherit the List<Command> class. It is also a node, and thus, should be an INode. The stringification command will just concatenate all command strings, one per line.

Finally, we will create an override of the default ANTLR visitors. They will be placed in the AST/TreeVisitor.cs file.

AST/TreeVisitor.cs

using Antlr4.Runtime.Misc;
using Antlr4.Runtime.Tree;
using Logo2Svg.Language;

namespace Logo2Svg.AST
{
    public class TreeVisitor : LogoBaseVisitor<INode>
    {
        public override INode VisitCommand([NotNull] LogoParser.CommandContext context)
        {
            string value = context.Value().GetText();
            string command = null;

            if (context.Forward() is { } forwardContext)
            {
                command = forwardContext.GetText();
            }
            if (context.Right() is { } rightContext)
            {
                command = rightContext.GetText();
            }

            return command != null ? new Command(command, value) : null;
        }

        public override INode VisitProgram([NotNull] LogoParser.ProgramContext context)
        {
            Program program = new();
            program.AddRange(context.command().Select(cmd => Visit(cmd) as Command).ToList());
            return program;
        }
    }
}

This code gets more complex, but not that hard to understand. The file includes two methods, which are visitors of the two rules we have in the grammar. They are defined in the LogoBaseVisitor class, generated by ANTLR. Note that this generated class is type-generic, meaning that the visitors can return whatever type is desired. That is why we define our class parent as being a visitor of type INode.

Visitors receive a context object. These objects represent each ANTLR rule. For example, the CommandContext allows us to access the Forward, Right and Value symbols used in the two productions.

As both productions have the Value symbol, we start by getting the string that matches it. For that, we use the GetText accessor. For the commands, we need to check if the symbols are null or not. The syntax we use there is modern C# that does something similar to:

            var forwardContext = context.Forward();
            if (contextForward != null)
            {
                command = forwardContext.GetText();
            }
            var rightContext = context.Right();
            if (contextRight != null)
            {
                command = rightContext.GetText();
            }

The interesting part about the syntax used above is that it gets a lot more compact.

Whatever symbol is not null, we just get the command string. At the end, we create and return a command object if the command name is not null, or null otherwise.

As for the other visitor, it is now simpler to understand. Because Command symbol in the Program rule can repeat itself (has the + modifier), the ProgramContext.Command method returns a list of contexts, one for each command. We just execute the ANTLR Visit method that invokes the respective visitors for each of these contexts. As we defined that visitor, we know that it returns a command, and therefore, we can cast it. Thus, that Linq line of code takes each command context, visits it, converts the result into a Command, and stores it in the Program object, that is returned at the end.

Finally, we can update the App.cs program to visit the parsing tree created by ANTLR and obtain the AST. Edit this file and replace the last lines of the try block:

App.cs

      var programContext = parser.program();
      var visitor = new TreeVisitor();
      Program program = visitor.Visit(programContext) as Program;
      Console.WriteLine(program);

So, from this extract, the first line returns the program context, which represents the top of the parse tree created by ANTLR. The second line instantiates our tree visitor, which is invoked in the third line. As we know we have a program context, we know the result is a Program, and therefore we can cast it. Finally, the last line presents a stringification of the program.

If you run this with the good example we used in the Part I of this tutorial, you will get the following output:

From good.logo to foo.svg
FORWARD(100)
RT(90)
FD(50)

Plaintext

Too much work? We could have some SVG already, but the idea is to create a good code base that will allow us to develop faster.

As before, the code was added into GitHub repository. Please note that there are some decisions on the previous part, as well as for this one, that are not final. To keep posts small and to guarantee code works at the end of each section, I decided to go step by step, focusing on some specific aspects I feel more important, and polishing code as we go.

Please be free to give comments, suggestions or complains. Having feedback is more than welcome, and it will allow me to be sure this is useful for anyone.

2 thoughts on “ANTLR Tutorial – Part II”

LuisPereira says:

February 17, 2024 at 12:44

Awesome stuff again!
One thing I may have missing from the first two parts is maybe a link to a GitHub repo with the code already in the desired finalised state (although it’s straightforward to replicate the code with the step by step described here :) )
ambs says:

February 18, 2024 at 20:59

You have a repository with tags for each tutorial part.

You must be logged in to post a comment.

/dev/null

ANTLR Tutorial – Part II

Grammar Simplification

Project Structure

1. Move the Grammar

2. Create Folders

3. File Renaming

4. Add a Namespace to ANTLR Generated Code

5. Update the Logo2Svg Project Configuration

6. Update Namespace References

Abstract Syntax Tree

2 thoughts on “ANTLR Tutorial – Part II”

Leave a Reply