BASIC Interpreter with Rust

Tokens and Lexing

Learning Objectives

You understand what tokens are and why we need them.
You can define token types using enums with data.
You know how to use peekable iterators for lookahead parsing.
You can build a lexer that transforms text into tokens.
You understand how to handle different character types (digits, letters, operators).

From Text to Tokens

When you read a sentence, you don’t process individual letters - you recognize words. Similarly, when processing code, we don’t work with individual characters - we recognize tokens.

Learning involves chunking information into meaningful units. Instead of seeing isolated pieces such as letters or characters, experts see larger patterns like words or tokens. The most famous example of expertise might relate to chess, where expert players see patterns of pieces instead of individual pieces.

A token is the smallest meaningful unit in a programming language. Just as “the,” “cat,” and “sat” are tokens in English, PRINT, 42, and + are tokens in BASIC.

Consider this BASIC line:

10 PRINT "Hello!"

A human sees the following pieces (the spaces are ignored):

10 - a line number
PRINT - a keyword
"Hello!" - a string literal

A lexer would transform the text into these tokens:

[
    Token::Number(10),
    Token::Print,
    Token::String("Hello!".to_string()),
]

Tokens capture both what kind of thing they are (number, keyword, string) and what value they contain (if applicable).

Why Not Work with Characters Directly?

Three reasons:

1. Abstraction

Working with tokens is easier than characters. Compare:

// With tokens:
if token == Token::Print { /* handle print */ }

// With characters:
if chars[0] == 'P' && chars[1] == 'R' && chars[2] == 'I'
   && chars[3] == 'N' && chars[4] == 'T' { /* ... */ }

2. Separation of Concerns

The parser doesn’t need to worry about spaces, comments, or how numbers are formatted. The lexer handles those details.

3. Error Detection

The lexer can catch errors early:

Invalid characters
Malformed numbers
Unterminated strings

Loading Exercise...

Defining Token Types

Let’s define our token types. We’ll use an enum because each token is one of several possible types.

Modify the file src/token.rs to match the following:

#[derive(Debug, Clone, PartialEq)]
pub enum Token {
    // literals
    Number(i32),
    String(String),
    Variable(String),
    // keywords
    Print,
    Let,
    If,
    Then,
    Goto,
    End,
    // operators
    Plus,
    Minus,
    Star,
    Slash,
    Equal,
    Greater,
    Less,
}

Let’s break down what each part means:

Derives

#[derive(Debug, Clone, PartialEq)]

Debug: So we can print tokens with {:?} for debugging
Clone: So we can copy tokens when needed
PartialEq: So we can compare tokens with == in tests

Variants with Data

Number(i32),
String(String),
Variable(String),

These tokens carry data:

Token::Number(42) represents the number 42
Token::String("Hello") represents the string “Hello”
Token::Variable("X") represents the variable X

Variants without Data

Print,
Let,
Plus,

These tokens are just markers - the variant itself is the information.

Token Categories

Our tokens fall into natural categories:

Literals - Values that appear in the code

Number(i32) - numeric literals like 42
String(String) - text in quotes like "Hello"
Variable(String) - variable names like X or COUNTER

Keywords - Reserved words with special meaning

Print, Let, Input, If, Then, Goto, End

Operators - Symbols that perform operations

Plus (+), Minus (-), Star (*), Slash (/)
Equal (=), Greater (>), Less (<)

Testing Token Creation

Let’s verify our tokens work by adding tests. Append the following to src/token.rs:


#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn test_token_creation() {
        let num = Token::Number(42);
        let keyword = Token::Print;
        let var = Token::Variable("X".to_string());

        // These should compile
        println!("{:?}", num);
        println!("{:?}", keyword);
        println!("{:?}", var);
    }

    #[test]
    fn test_token_equality() {
        assert_eq!(Token::Print, Token::Print);
        assert_eq!(Token::Number(5), Token::Number(5));
        assert_ne!(Token::Plus, Token::Minus);
    }

    #[test]
    fn test_token_clone() {
        let original = Token::String("Hello".to_string());
        let cloned = original.clone();
        assert_eq!(original, cloned);
    }
}

Run the tests:

cargo test token

The output should contain information about the following tests:

running 3 tests
test token::tests::test_token_clone ... ok
test token::tests::test_token_creation ... ok
test token::tests::test_token_equality ... ok

Peekable Iterators

Before going into the lexer implementation, let’s learn about peekable iterators. With normal iterators, once you consume an element, it’s gone. You can’t look ahead to see next item without consuming it.

Creating a Peekable Iterator

The .peekable() method converts any iterator into a peekable one:

The peek() method returns Option<&T> - it lets you look at the next element without removing it from the iterator.

Example: Parsing with Lookahead

Peekable iterators are essential for parsing. Here’s a simple expression parser that looks ahead to figure out whether it’s reading a number or an operator:

This tokenizer needs peek() to:

Check what character comes next without consuming it
Collect multiple digits into a single number token
Know when to stop collecting digits

Lexemes vs. Tokens

Lexemes are the raw sequences of characters that the lexer identifies in the input text. Tokens are the categorized versions of those lexemes.

Loading Exercise...

Building the Lexer

Let’s start building the lexer. The lexer reads characters and produces tokens. The overall idea of lexing a string is shown below in Figure 1.

Fig 1. — Decision tree for tokenizing different character types. Moving back to reading a character after producing a token is omitted.

When given input, we read the input character by character, decide what kind of token is starting, collect characters until the token is complete, create the appropriate token, and repeat until no characters remain.

Lexer Structure

Modify the file src/lexer.rs and add the following content to it.

use crate::token::Token;

pub struct Lexer {
    chars: std::iter::Peekable<std::vec::IntoIter<char>>,
}

impl Lexer {
    pub fn new(input: &str) -> Self {
        let upper = input.to_uppercase();
        let chars: Vec<char> = upper.chars().collect();

        Self {
            chars: chars.into_iter().peekable(),
        }
    }
}

The above creates a basic structure for our lexer. The lexer will take a string slice as an input, convert it to uppercase (BASIC is case-insensitive), and create a peekable iterator over its characters. The Lexer struct holds this iterator. Don’t worry about the type of chars — it’s a peekable iterator over characters; Rust is what it is!

Peek-and-Consume Pattern

Implementation-wise, the lexer should (1) peek at the next character, (2) decide what kind of token is starting, (3) collect characters until the token is complete, (4) create the appropriate token, and (5) repeat until no characters remain.

This is the peek-and-consume pattern:

// As long as there are characters to read,
// As long as there are characters to read,
// peek at the next character
while let Some(&ch) = self.chars.peek() {
    // The decision-making based on ch goes here
    // e.g., match the character to decide what to do
    // and, once decided and done, consume the
    // character when ready
    self.chars.next();
}

The reason for this is that we need to know what’s coming to decide how to handle it before we consume it. For example, if it’s a digit, we need to keep peeking and consuming until we hit a non-digit to form the complete number token. Similarly, if it’s a ’”’, we need to keep consuming until we find the closing ’”’ to form the complete string token.

Only after we know what we’re dealing with do we consume characters.

Loading Exercise...

Main Tokenization Loop

Let’s implement the main tokenization method:

// use and other parts as before

impl Lexer {
    // new method as before

    pub fn tokenize(&mut self) -> Result<Vec<Token>, String> {
        let mut tokens = Vec::new();

        while let Some(&ch) = self.chars.peek() {
            match ch {
                ' ' | '\t' => {
                    self.chars.next();
                }
                '0'..='9' => tokens.push(self.read_number()?),
                '"' => tokens.push(self.read_string()?),
                'A'..='Z' => tokens.push(self.read_word()),
                '+' => {
                    tokens.push(Token::Plus);
                    self.chars.next();
                }
                '-' => {
                    tokens.push(Token::Minus);
                    self.chars.next();
                }
                '*' => {
                    tokens.push(Token::Star);
                    self.chars.next();
                }
                '/' => {
                    tokens.push(Token::Slash);
                    self.chars.next();
                }
                '=' => {
                    tokens.push(Token::Equal);
                    self.chars.next();
                }
                '>' => {
                    tokens.push(Token::Greater);
                    self.chars.next();
                }
                '<' => {
                    tokens.push(Token::Less);
                    self.chars.next();
                }
                _ => {
                    return Err(format!("Unexpected character: '{}'", ch));
                }
            }
        }

        Ok(tokens)
    }

    // helper methods (read_number, read_string, read_word) go here
}

Above, the tokenize method implements the main loop of the lexer. It repeatedly peeks at the next character and uses pattern matching to decide how to handle it. For simple single-character tokens (like +), it creates the token and consumes the character immediately. For more complex tokens (like numbers, strings, and words), it calls helper methods that handle the details of collecting characters and forming the complete token.

Helper Methods

We’ll implement three helper methods: read_number, read_string, and read_word. Each method will handle the specifics of reading its respective token type.

Reading Numbers

Numbers can have multiple digits. We need to collect all the digits before creating the token. Here, we again use the peek-and-consume pattern: we first create a string to hold the digits, then peek at each character to see if it’s a digit. If it is, we consume it and add it to our string. We stop when we hit a non-digit.

Then, we parse the collected string into a number and create the Token::Number token.

    fn read_number(&mut self) -> Result<Token, String> {
        let mut num = String::new();
        while let Some(&c) = self.chars.peek() {
            if c.is_ascii_digit() {
                num.push(c);
                self.chars.next();
            } else {
                break;
            }
        }

        num.parse()
            .map(Token::Number)
            .map_err(|_| format!("Invalid number: {}", num))
    }

Loading Exercise...

Reading Strings

Strings start and end with quotes. We need to collect everything between the quotes.

    fn read_string(&mut self) -> Result<Token, String> {
        self.chars.next(); // skip opening quote
        let mut str_val = String::new();

        while let Some(c) = self.chars.next() {
            if c == '"' {
                return Ok(Token::String(str_val));
            }
            str_val.push(c);
        }

        Err("Unterminated string".to_string())
    }

Lexers in real languages handle escape sequences (like \" for a quote inside a string), but we’ll keep it simple here.

Reading Words (Keywords and Variables)

Words can be keywords (like PRINT) or variable names (like X). We read the word first, then decide which it is.

    fn read_word(&mut self) -> Token {
        let mut word = String::new();
        while let Some(&c) = self.chars.peek() {
            if c.is_alphanumeric() {
                word.push(c);
                self.chars.next();
            } else {
                break;
            }
        }

        match word.as_str() {
            "PRINT" => Token::Print,
            "LET" => Token::Let,
            "IF" => Token::If,
            "THEN" => Token::Then,
            "GOTO" => Token::Goto,
            "END" => Token::End,
            _ => Token::Variable(word),
        }
    }

Complete Lexer

Here’s the complete lexer in one place:

use crate::token::Token;

pub struct Lexer {
    chars: std::iter::Peekable<std::vec::IntoIter<char>>,
}

impl Lexer {
    pub fn new(input: &str) -> Self {
        let upper = input.to_uppercase();
        let chars: Vec<char> = upper.chars().collect();

        Self {
            chars: chars.into_iter().peekable(),
        }
    }

    /// Tokenize the entire input
    pub fn tokenize(&mut self) -> Result<Vec<Token>, String> {
        let mut tokens = Vec::new();

        while let Some(&ch) = self.chars.peek() {
            match ch {
                ' ' | '\t' => {
                    self.chars.next();
                }
                '0'..='9' => tokens.push(self.read_number()?),
                '"' => tokens.push(self.read_string()?),
                'A'..='Z' => tokens.push(self.read_word()), // now always uppercase
                '+' => {
                    tokens.push(Token::Plus);
                    self.chars.next();
                }
                '-' => {
                    tokens.push(Token::Minus);
                    self.chars.next();
                }
                '*' => {
                    tokens.push(Token::Star);
                    self.chars.next();
                }
                '/' => {
                    tokens.push(Token::Slash);
                    self.chars.next();
                }
                '=' => {
                    tokens.push(Token::Equal);
                    self.chars.next();
                }
                '>' => {
                    tokens.push(Token::Greater);
                    self.chars.next();
                }
                '<' => {
                    tokens.push(Token::Less);
                    self.chars.next();
                }
                _ => {
                    return Err(format!("Unexpected character: '{}'", ch));
                }
            }
        }

        Ok(tokens)
    }

    fn read_number(&mut self) -> Result<Token, String> {
        let mut num = String::new();
        while let Some(&c) = self.chars.peek() {
            if c.is_ascii_digit() {
                num.push(c);
                self.chars.next();
            } else {
                break;
            }
        }

        num.parse()
            .map(Token::Number)
            .map_err(|_| format!("Invalid number: {}", num))
    }

    fn read_string(&mut self) -> Result<Token, String> {
        self.chars.next(); // skip opening quote
        let mut str_val = String::new();

        while let Some(c) = self.chars.next() {
            if c == '"' {
                return Ok(Token::String(str_val));
            }
            str_val.push(c);
        }

        Err("Unterminated string".to_string())
    }

    fn read_word(&mut self) -> Token {
        let mut word = String::new();
        while let Some(&c) = self.chars.peek() {
            if c.is_alphanumeric() {
                word.push(c);
                self.chars.next();
            } else {
                break;
            }
        }

        match word.as_str() {
            "PRINT" => Token::Print,
            "LET" => Token::Let,
            "IF" => Token::If,
            "THEN" => Token::Then,
            "GOTO" => Token::Goto,
            "END" => Token::End,
            _ => Token::Variable(word),
        }
    }
}

About 120 lines of code to transform text into tokens. Could be worse!

Note: In a production lexer, you’d want to handle more cases, such as comments, floating-point numbers, and error reporting with line/column numbers. But this is a good start for our BASIC interpreter.

Loading Exercise...

Testing the Lexer

Let’s next add tests to ensure our lexer works correctly. Append the following to src/lexer.rs:

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn test_empty_input() {
        let mut lexer = Lexer::new("");
        let tokens = lexer.tokenize().unwrap();
        assert_eq!(tokens.len(), 0);
    }

    #[test]
    fn test_whitespace_handling() {
        let mut lexer = Lexer::new("   10    20   ");
        let tokens = lexer.tokenize().unwrap();
        assert_eq!(tokens.len(), 2);
        assert_eq!(tokens[0], Token::Number(10));
        assert_eq!(tokens[1], Token::Number(20));
    }

    #[test]
    fn test_integers() {
        let mut lexer = Lexer::new("10 20 30");
        let tokens = lexer.tokenize().unwrap();

        assert_eq!(tokens[0], Token::Number(10));
        assert_eq!(tokens[1], Token::Number(20));
        assert_eq!(tokens[2], Token::Number(30));
    }

    #[test]
    fn test_strings() {
        let mut lexer = Lexer::new(r#""Hello" "World""#);
        let tokens = lexer.tokenize().unwrap();

        assert_eq!(tokens[0], Token::String("HELLO".to_string()));
        assert_eq!(tokens[1], Token::String("WORLD".to_string()));
    }

    #[test]
    fn test_unterminated_string() {
        let mut lexer = Lexer::new(r#""Hello"#);
        let result = lexer.tokenize();

        assert!(result.is_err());
    }

    #[test]
    fn test_keywords() {
        let mut lexer = Lexer::new("PRINT LET IF THEN GOTO END");
        let tokens = lexer.tokenize().unwrap();

        assert_eq!(tokens[0], Token::Print);
        assert_eq!(tokens[1], Token::Let);
        assert_eq!(tokens[2], Token::If);
        assert_eq!(tokens[3], Token::Then);
        assert_eq!(tokens[4], Token::Goto);
        assert_eq!(tokens[5], Token::End);
    }

    #[test]
    fn test_case_insensitive_keywords() {
        let mut lexer = Lexer::new("print PRINT Print pRiNt");
        let tokens = lexer.tokenize().unwrap();

        // All should be recognized as PRINT
        assert!(tokens.iter().all(|t| *t == Token::Print));
    }

    #[test]
    fn test_variables() {
        let mut lexer = Lexer::new("X Y COUNTER");
        let tokens = lexer.tokenize().unwrap();

        assert_eq!(tokens[0], Token::Variable("X".to_string()));
        assert_eq!(tokens[1], Token::Variable("Y".to_string()));
        assert_eq!(tokens[2], Token::Variable("COUNTER".to_string()));
    }

    #[test]
    fn test_operators() {
        let mut lexer = Lexer::new("+ - * / = > <");
        let tokens = lexer.tokenize().unwrap();

        assert_eq!(tokens[0], Token::Plus);
        assert_eq!(tokens[1], Token::Minus);
        assert_eq!(tokens[2], Token::Star);
        assert_eq!(tokens[3], Token::Slash);
        assert_eq!(tokens[4], Token::Equal);
        assert_eq!(tokens[5], Token::Greater);
        assert_eq!(tokens[6], Token::Less);
    }

    #[test]
    fn test_complex_expression() {
        let mut lexer = Lexer::new("10 PRINT X + 5");
        let tokens = lexer.tokenize().unwrap();

        assert_eq!(tokens.len(), 5);
        assert_eq!(tokens[0], Token::Number(10));
        assert_eq!(tokens[1], Token::Print);
        assert_eq!(tokens[2], Token::Variable("X".to_string()));
        assert_eq!(tokens[3], Token::Plus);
        assert_eq!(tokens[4], Token::Number(5));
    }

    #[test]
    fn test_string_with_spaces() {
        let mut lexer = Lexer::new(r#"PRINT "Hello World""#);
        let tokens = lexer.tokenize().unwrap();

        assert_eq!(tokens[0], Token::Print);
        assert_eq!(tokens[1], Token::String("HELLO WORLD".to_string()));
    }

    #[test]
    fn test_invalid_character() {
        let mut lexer = Lexer::new("@");
        let result = lexer.tokenize();

        assert!(result.is_err());
        if let Err(str) = result {
            assert!(str.contains("Unexpected character"));
        } else {
            panic!("Expected ParseError");
        }
    }

    #[test]
    fn test_complete_line() {
        let mut lexer = Lexer::new(r#"10 LET X = 5"#);
        let tokens = lexer.tokenize().unwrap();

        assert_eq!(tokens.len(), 5);
        assert_eq!(tokens[0], Token::Number(10));
        assert_eq!(tokens[1], Token::Let);
        assert_eq!(tokens[2], Token::Variable("X".to_string()));
        assert_eq!(tokens[3], Token::Equal);
        assert_eq!(tokens[4], Token::Number(5));
    }

    #[test]
    fn test_if_statement() {
        let mut lexer = Lexer::new("30 IF X > 10 THEN 50");
        let tokens = lexer.tokenize().unwrap();

        assert_eq!(tokens.len(), 7);
        assert_eq!(tokens[0], Token::Number(30));
        assert_eq!(tokens[1], Token::If);
        assert_eq!(tokens[2], Token::Variable("X".to_string()));
        assert_eq!(tokens[3], Token::Greater);
        assert_eq!(tokens[4], Token::Number(10));
        assert_eq!(tokens[5], Token::Then);
        assert_eq!(tokens[6], Token::Number(50));
    }
}

When you run the tests with cargo test lexer, you should see all tests passing, confirming that our lexer works correctly across a variety of cases.

Trying the Lexer

Finally, let’s modify the main.rs to try our lexer interactively:

use basic_interpreter::lexer::Lexer;

fn main() {
    let input = r#"10 PRINT "Hello, World!""#;

    println!("Input: {}", input);
    println!();

    let mut lexer = Lexer::new(input);
    match lexer.tokenize() {
        Ok(tokens) => {
            println!("Tokens:");
            for token in tokens {
                println!("  {:?}", token);
            }
        }
        Err(e) => {
            println!("Error: {}", e);
        }
    }
}

When you run the program with cargo run, you should see the following output:

Input: 10 PRINT "Hello, World!"

Tokens:
  Number(10)
  Print
  String("HELLO, WORLD!")

Loading Exercise...

Summary

In this part, we looked into building a lexer that transforms raw text into tokens. To summarize:

Tokens are the fundamental units of a programming language. A lexer converts text into tokens.
We defined token types using an enum with variants for literals, keywords, and operators.
We used peekable iterators to look ahead at characters without consuming them.
The peek-and-consume pattern is essential for lexing multi-character tokens.
We implemented helper methods to read numbers, strings, and words.
We wrote tests to ensure our lexer works correctly.

Next, we’ll look into parsing these tokens into an Abstract Syntax Tree (AST), which represents the structure of our programs.

← Building an Interpreter: Architecture and Design

Parsing Tokens to Expressions and Statements →