Programming Language Foundations

Building a Calculator

Learning Objectives

You know of concrete syntax trees (ASTs) and how to define them in Rust.
You know what an abstract syntax tree (AST) is and how to define one in Rust.
You know how an AST can be used to represent arithmetic computations.
You know how to implement and extend a recursive calculation function.

A Language for Arithmetic

Let’s now design a language for a calculator which uses a prefix-style Polish notation. The tokens in the language are number literals and operators have fixed arity of two. Number literals are any sequence of number characters separated by whitespace. Operators are +, -, *, etc.

Here are some examples of expressions in this language:

0
+ 1 2
- 3 5
+ + 1 2 3, which is interpreted as + (+ 1 2) 3

For now, negative number literals are not supported.

The grammar of the language is defined using the following rules:

A Digit is a character between 0-9.
A Literal is a string of one or more Digits.
An Operator is one of +, - or *.
A Space is a string of one or more whitespace characters.
An Expression is an Op followed by two Exprs separated by Spaces, or it’s a Lit. Notice that operators or spaces are not expressions.

The grammar in Backus-Naur form is as follows:

Here, the characters in double quotes denote terminals of the language, i.e. non-reducible lexical elements.

Loading Exercise...

Next we define the grammar in Rust with enums for each kind of symbol.

No files opened Select a file to edit

Expr is called the concrete syntax tree (abbreviated CST, sometimes called a parse tree) of the language. It represents grammatically valid expressions in the language with enough detail to convert back to the original source text.

An abstract syntax tree (AST), on the other hand, gets rid of irrelevant syntactic information, such as whitespace and comments. We will look at the AST form of the above language next.

Abstract Syntax Trees

An abstract syntax tree is a tree-like (i.e. recursive) data structure just like the CST. However, instead of splitting the grammar over multiple syntax kinds, we have just one data type.

One of the goals of defining an AST is to retain only relevant information about the input. For example whitespace gets removed and different operators become different node variants in the tree.

The AST for the arithmetic language can be defined in Rust using a single enum:

No files opened Select a file to edit

What’s notable is that the AST does not have anything to do with the CST directly. For example, literals in the AST are just integers (represented by the Num variant), whereas literals in the CST were linked lists of digits.

As AST is an enum with multiple variants, it has multiple constructors. As mathematics doesn’t need to deal with memory management, we write the constructors mathematically as follows.

We also use the mathematical (unbounded) integers to represent the 32-bit integers (i32) as we don’t have any practical limitations in mathematics that arise in the physical world.

The conversion from a CST to an AST is typically a good place to enforce certain kinds of rules that the language should satisfy. For example if an integer literal is out of range, the conversion should return an error. Another example could be scope lexical analysis, i.e. checking for undefined and unused variables. Unused variables don’t need to be rejected, but can just produce a warning for example.

Loading Exercise...

Converting CST to AST

The process of converting CST to AST is a simple recursive algorithm where the complexity mostly lies in the literal case.

No files opened Select a file to edit

Memory Management Options

As you may recall from the Rust chapters or the Rust lecture, in Rust it is necessary to introduce indirection to a recursive enum to make its size known. So far we’ve always used Box, which is a smart pointer to a heap-allocated memory region. Box works very well for our purposes but it has some shortcomings, one of the most significant being that we end up cloning the data within much more than is actually necessary.

As an alternative we could use an Rc, which is a reference-counting pointer. A reference counting pointer implements shared ownership of a value on the heap, dropping it when the last reference is dropped. Cloning an Rc is significantly cheaper than cloning a Box because we only need to increment the counter instead of allocating a new Box and cloning the data there. An Rc cannot provide mutable access to the value within like Box does, but mutable access is usually not needed when working with ASTs.

Read more about Rc in the Rust standard library documentation

Parentheses

We will augment the grammar with parentheses to make it easier to read and write complex expressions.

Loading Exercise...

The Calculator

To make the calculator usable we need to write a parser for it.

A parser is a function that converts source text to a CST according to the grammar of the language. Writing a parser from scratch (without using external libraries) can be a fun challenge, but is not our approach — instead we will use the nom library. nom provides all the technical parts so that we can just focus on what’s important.

Parsing to CST versus AST

A parser can also directly produce the AST without going through CST first. This is usually simpler and suitable for smaller applications. However when writing a programming language, we would like to retain source code related information such as line numbers and columns — which the AST loses.

We therefore prefer to make the parser in two stages: parser : String -> CST and convert : CST -> AST.

The following example shows how one would write a nom parser for the arithmetic language.

Click the cog icon in the editor and try providing + + 1 2 3 as input.

No files opened Select a file to edit

Note that right now you are not expected to understand how it works, and you don’t need to write any parser code yourself. Instead we will look at a high-level overview of the parser.

Nom Nom Chomp Chomp — How does it work?

Let’s analyze the previous code example to get a grasp of how the parser works.

First, we used a type alias for the parse results. Our parse results use a VerboseError for improved parse errors. To see this, provide e.g. ((+ 1 2) as input to it.

type IResult<I, O> = nom::IResult<I, O, VerboseError<I>>;

The I and O type variables stand for input and output. Input is the type of the source text that the parser consumes. Output is the type of data that it produces.

Now, let’s look at the actual parsers’s signatures.

pub fn parse_digit(input: &str) -> IResult<&str, Digit>

pub fn parse_lit(input: &str) -> IResult<&str, Lit>

pub fn parse_op(input: &str) -> IResult<&str, Op>

pub fn parse_space(input: &str) -> IResult<&str, Space>

pub fn parse_expr(input: &str) -> IResult<&str, Expr>

As you probably noticed, there’s a naming convention: we write parse_thing for a parser that returns a Thing as the output.

The top-level parser one would use to now parse an arithmetic expression is all_consuming(parse_expr) where all_consuming wraps a parser such that it fails if the parser stops too early. You can see this in action in the previous example’s main function.

Read the nom documentation to learn more about nom and parser combinators.

Calculator Semantics

Let’s now focus on the semantic side of the calculator — or more precisely, how to turn the AST into a numerical value.

The semantics are rather simple and easy to express mathematically. We define the calculate function that takes in an AST and returns an i32 (which we generalize to mathematical integers ).

As AST has four constructors, we write an equation for every constructor. For example to calculate a Num(n) we simply return n. The other cases use recursion as expected.

Your task is now to first translate this definition to Rust and then add some extra features to it.

Loading Exercise...

← Overview

Theory of Programming Languages →