Tokens and Lexing
Learning Objectives
- You understand what tokens are and why we need them.
- You can define token types using enums with data.
- You know how to use peekable iterators for lookahead parsing.
- You can build a lexer that transforms text into tokens.
- You understand how to handle different character types (digits, letters, operators).
From Text to Tokens
When you read a sentence, you don’t process individual letters - you recognize words. Similarly, when processing code, we don’t work with individual characters - we recognize tokens.
Learning involves chunking information into meaningful units. Instead of seeing isolated pieces such as letters or characters, experts see larger patterns like words or tokens. The most famous example of expertise might relate to chess, where expert players see patterns of pieces instead of individual pieces.
A token is the smallest meaningful unit in a programming language. Just as “the,” “cat,” and “sat” are tokens in English, PRINT, 42, and + are tokens in BASIC.
Consider this BASIC line:
10 PRINT "Hello!"
A human sees the following pieces (the spaces are ignored):
10- a line numberPRINT- a keyword"Hello!"- a string literal
A lexer would transform the text into these tokens:
[
Token::Number(10),
Token::Print,
Token::String("Hello!".to_string()),
]
Tokens capture both what kind of thing they are (number, keyword, string) and what value they contain (if applicable).
Why Not Work with Characters Directly?
Three reasons:
1. Abstraction
Working with tokens is easier than characters. Compare:
// With tokens:
if token == Token::Print { /* handle print */ }
// With characters:
if chars[0] == 'P' && chars[1] == 'R' && chars[2] == 'I'
&& chars[3] == 'N' && chars[4] == 'T' { /* ... */ }2. Separation of Concerns
The parser doesn’t need to worry about spaces, comments, or how numbers are formatted. The lexer handles those details.
3. Error Detection
The lexer can catch errors early:
- Invalid characters
- Malformed numbers
- Unterminated strings
Defining Token Types
Let’s define our token types. We’ll use an enum because each token is one of several possible types.
Modify the file src/token.rs to match the following:
#[derive(Debug, Clone, PartialEq)]
pub enum Token {
// literals
Number(i32),
String(String),
Variable(String),
// keywords
Print,
Let,
If,
Then,
Goto,
End,
// operators
Plus,
Minus,
Star,
Slash,
Equal,
Greater,
Less,
}
Let’s break down what each part means:
Derives
#[derive(Debug, Clone, PartialEq)]
Debug: So we can print tokens with{:?}for debuggingClone: So we can copy tokens when neededPartialEq: So we can compare tokens with==in tests
Variants with Data
Number(i32),
String(String),
Variable(String),
These tokens carry data:
Token::Number(42)represents the number 42Token::String("Hello")represents the string “Hello”Token::Variable("X")represents the variable X
Variants without Data
Print,
Let,
Plus,
These tokens are just markers - the variant itself is the information.
Token Categories
Our tokens fall into natural categories:
Literals - Values that appear in the code
Number(i32)- numeric literals like42String(String)- text in quotes like"Hello"Variable(String)- variable names likeXorCOUNTER
Keywords - Reserved words with special meaning
Print,Let,Input,If,Then,Goto,End
Operators - Symbols that perform operations
Plus(+),Minus(-),Star(*),Slash(/)Equal(=),Greater(>),Less(<)
Testing Token Creation
Let’s verify our tokens work by adding tests. Append the following to src/token.rs:
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_token_creation() {
let num = Token::Number(42);
let keyword = Token::Print;
let var = Token::Variable("X".to_string());
// These should compile
println!("{:?}", num);
println!("{:?}", keyword);
println!("{:?}", var);
}
#[test]
fn test_token_equality() {
assert_eq!(Token::Print, Token::Print);
assert_eq!(Token::Number(5), Token::Number(5));
assert_ne!(Token::Plus, Token::Minus);
}
#[test]
fn test_token_clone() {
let original = Token::String("Hello".to_string());
let cloned = original.clone();
assert_eq!(original, cloned);
}
}
Run the tests:
cargo test token
The output should contain information about the following tests:
running 3 tests
test token::tests::test_token_clone ... ok
test token::tests::test_token_creation ... ok
test token::tests::test_token_equality ... ok
Peekable Iterators
Before going into the lexer implementation, let’s learn about peekable iterators. With normal iterators, once you consume an element, it’s gone. You can’t look ahead to see next item without consuming it.
Creating a Peekable Iterator
The .peekable() method converts any iterator into a peekable one:
The peek() method returns Option<&T> - it lets you look at the next element without removing it from the iterator.
Example: Parsing with Lookahead
Peekable iterators are essential for parsing. Here’s a simple expression parser that looks ahead to figure out whether it’s reading a number or an operator:
This tokenizer needs peek() to:
- Check what character comes next without consuming it
- Collect multiple digits into a single number token
- Know when to stop collecting digits
Lexemes are the raw sequences of characters that the lexer identifies in the input text. Tokens are the categorized versions of those lexemes.
Building the Lexer
Let’s start building the lexer. The lexer reads characters and produces tokens. The overall idea of lexing a string is shown below in Figure 1.
When given input, we read the input character by character, decide what kind of token is starting, collect characters until the token is complete, create the appropriate token, and repeat until no characters remain.
Lexer Structure
Modify the file src/lexer.rs and add the following content to it.
use crate::token::Token;
pub struct Lexer {
chars: std::iter::Peekable<std::vec::IntoIter<char>>,
}
impl Lexer {
pub fn new(input: &str) -> Self {
let upper = input.to_uppercase();
let chars: Vec<char> = upper.chars().collect();
Self {
chars: chars.into_iter().peekable(),
}
}
}
The above creates a basic structure for our lexer. The lexer will take a string slice as an input, convert it to uppercase (BASIC is case-insensitive), and create a peekable iterator over its characters. The Lexer struct holds this iterator. Don’t worry about the type of chars — it’s a peekable iterator over characters; Rust is what it is!
Peek-and-Consume Pattern
Implementation-wise, the lexer should (1) peek at the next character, (2) decide what kind of token is starting, (3) collect characters until the token is complete, (4) create the appropriate token, and (5) repeat until no characters remain.
This is the peek-and-consume pattern:
// As long as there are characters to read,
// As long as there are characters to read,
// peek at the next character
while let Some(&ch) = self.chars.peek() {
// The decision-making based on ch goes here
// e.g., match the character to decide what to do
// and, once decided and done, consume the
// character when ready
self.chars.next();
}
The reason for this is that we need to know what’s coming to decide how to handle it before we consume it. For example, if it’s a digit, we need to keep peeking and consuming until we hit a non-digit to form the complete number token. Similarly, if it’s a ’”’, we need to keep consuming until we find the closing ’”’ to form the complete string token.
Only after we know what we’re dealing with do we consume characters.
Main Tokenization Loop
Let’s implement the main tokenization method:
// use and other parts as before
impl Lexer {
// new method as before
pub fn tokenize(&mut self) -> Result<Vec<Token>, String> {
let mut tokens = Vec::new();
while let Some(&ch) = self.chars.peek() {
match ch {
' ' | '\t' => {
self.chars.next();
}
'0'..='9' => tokens.push(self.read_number()?),
'"' => tokens.push(self.read_string()?),
'A'..='Z' => tokens.push(self.read_word()),
'+' => {
tokens.push(Token::Plus);
self.chars.next();
}
'-' => {
tokens.push(Token::Minus);
self.chars.next();
}
'*' => {
tokens.push(Token::Star);
self.chars.next();
}
'/' => {
tokens.push(Token::Slash);
self.chars.next();
}
'=' => {
tokens.push(Token::Equal);
self.chars.next();
}
'>' => {
tokens.push(Token::Greater);
self.chars.next();
}
'<' => {
tokens.push(Token::Less);
self.chars.next();
}
_ => {
return Err(format!("Unexpected character: '{}'", ch));
}
}
}
Ok(tokens)
}
// helper methods (read_number, read_string, read_word) go here
}
Above, the tokenize method implements the main loop of the lexer. It repeatedly peeks at the next character and uses pattern matching to decide how to handle it. For simple single-character tokens (like +), it creates the token and consumes the character immediately. For more complex tokens (like numbers, strings, and words), it calls helper methods that handle the details of collecting characters and forming the complete token.
Helper Methods
We’ll implement three helper methods: read_number, read_string, and read_word. Each method will handle the specifics of reading its respective token type.
Reading Numbers
Numbers can have multiple digits. We need to collect all the digits before creating the token. Here, we again use the peek-and-consume pattern: we first create a string to hold the digits, then peek at each character to see if it’s a digit. If it is, we consume it and add it to our string. We stop when we hit a non-digit.
Then, we parse the collected string into a number and create the Token::Number token.
fn read_number(&mut self) -> Result<Token, String> {
let mut num = String::new();
while let Some(&c) = self.chars.peek() {
if c.is_ascii_digit() {
num.push(c);
self.chars.next();
} else {
break;
}
}
num.parse()
.map(Token::Number)
.map_err(|_| format!("Invalid number: {}", num))
}
Reading Strings
Strings start and end with quotes. We need to collect everything between the quotes.
fn read_string(&mut self) -> Result<Token, String> {
self.chars.next(); // skip opening quote
let mut str_val = String::new();
while let Some(c) = self.chars.next() {
if c == '"' {
return Ok(Token::String(str_val));
}
str_val.push(c);
}
Err("Unterminated string".to_string())
}
Lexers in real languages handle escape sequences (like \" for a quote inside a string), but we’ll keep it simple here.
Reading Words (Keywords and Variables)
Words can be keywords (like PRINT) or variable names (like X). We read the word first, then decide which it is.
fn read_word(&mut self) -> Token {
let mut word = String::new();
while let Some(&c) = self.chars.peek() {
if c.is_alphanumeric() {
word.push(c);
self.chars.next();
} else {
break;
}
}
match word.as_str() {
"PRINT" => Token::Print,
"LET" => Token::Let,
"IF" => Token::If,
"THEN" => Token::Then,
"GOTO" => Token::Goto,
"END" => Token::End,
_ => Token::Variable(word),
}
}
Complete Lexer
Here’s the complete lexer in one place:
use crate::token::Token;
pub struct Lexer {
chars: std::iter::Peekable<std::vec::IntoIter<char>>,
}
impl Lexer {
pub fn new(input: &str) -> Self {
let upper = input.to_uppercase();
let chars: Vec<char> = upper.chars().collect();
Self {
chars: chars.into_iter().peekable(),
}
}
/// Tokenize the entire input
pub fn tokenize(&mut self) -> Result<Vec<Token>, String> {
let mut tokens = Vec::new();
while let Some(&ch) = self.chars.peek() {
match ch {
' ' | '\t' => {
self.chars.next();
}
'0'..='9' => tokens.push(self.read_number()?),
'"' => tokens.push(self.read_string()?),
'A'..='Z' => tokens.push(self.read_word()), // now always uppercase
'+' => {
tokens.push(Token::Plus);
self.chars.next();
}
'-' => {
tokens.push(Token::Minus);
self.chars.next();
}
'*' => {
tokens.push(Token::Star);
self.chars.next();
}
'/' => {
tokens.push(Token::Slash);
self.chars.next();
}
'=' => {
tokens.push(Token::Equal);
self.chars.next();
}
'>' => {
tokens.push(Token::Greater);
self.chars.next();
}
'<' => {
tokens.push(Token::Less);
self.chars.next();
}
_ => {
return Err(format!("Unexpected character: '{}'", ch));
}
}
}
Ok(tokens)
}
fn read_number(&mut self) -> Result<Token, String> {
let mut num = String::new();
while let Some(&c) = self.chars.peek() {
if c.is_ascii_digit() {
num.push(c);
self.chars.next();
} else {
break;
}
}
num.parse()
.map(Token::Number)
.map_err(|_| format!("Invalid number: {}", num))
}
fn read_string(&mut self) -> Result<Token, String> {
self.chars.next(); // skip opening quote
let mut str_val = String::new();
while let Some(c) = self.chars.next() {
if c == '"' {
return Ok(Token::String(str_val));
}
str_val.push(c);
}
Err("Unterminated string".to_string())
}
fn read_word(&mut self) -> Token {
let mut word = String::new();
while let Some(&c) = self.chars.peek() {
if c.is_alphanumeric() {
word.push(c);
self.chars.next();
} else {
break;
}
}
match word.as_str() {
"PRINT" => Token::Print,
"LET" => Token::Let,
"IF" => Token::If,
"THEN" => Token::Then,
"GOTO" => Token::Goto,
"END" => Token::End,
_ => Token::Variable(word),
}
}
}
About 120 lines of code to transform text into tokens. Could be worse!
Note: In a production lexer, you’d want to handle more cases, such as comments, floating-point numbers, and error reporting with line/column numbers. But this is a good start for our BASIC interpreter.
Testing the Lexer
Let’s next add tests to ensure our lexer works correctly. Append the following to src/lexer.rs:
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_empty_input() {
let mut lexer = Lexer::new("");
let tokens = lexer.tokenize().unwrap();
assert_eq!(tokens.len(), 0);
}
#[test]
fn test_whitespace_handling() {
let mut lexer = Lexer::new(" 10 20 ");
let tokens = lexer.tokenize().unwrap();
assert_eq!(tokens.len(), 2);
assert_eq!(tokens[0], Token::Number(10));
assert_eq!(tokens[1], Token::Number(20));
}
#[test]
fn test_integers() {
let mut lexer = Lexer::new("10 20 30");
let tokens = lexer.tokenize().unwrap();
assert_eq!(tokens[0], Token::Number(10));
assert_eq!(tokens[1], Token::Number(20));
assert_eq!(tokens[2], Token::Number(30));
}
#[test]
fn test_strings() {
let mut lexer = Lexer::new(r#""Hello" "World""#);
let tokens = lexer.tokenize().unwrap();
assert_eq!(tokens[0], Token::String("HELLO".to_string()));
assert_eq!(tokens[1], Token::String("WORLD".to_string()));
}
#[test]
fn test_unterminated_string() {
let mut lexer = Lexer::new(r#""Hello"#);
let result = lexer.tokenize();
assert!(result.is_err());
}
#[test]
fn test_keywords() {
let mut lexer = Lexer::new("PRINT LET IF THEN GOTO END");
let tokens = lexer.tokenize().unwrap();
assert_eq!(tokens[0], Token::Print);
assert_eq!(tokens[1], Token::Let);
assert_eq!(tokens[2], Token::If);
assert_eq!(tokens[3], Token::Then);
assert_eq!(tokens[4], Token::Goto);
assert_eq!(tokens[5], Token::End);
}
#[test]
fn test_case_insensitive_keywords() {
let mut lexer = Lexer::new("print PRINT Print pRiNt");
let tokens = lexer.tokenize().unwrap();
// All should be recognized as PRINT
assert!(tokens.iter().all(|t| *t == Token::Print));
}
#[test]
fn test_variables() {
let mut lexer = Lexer::new("X Y COUNTER");
let tokens = lexer.tokenize().unwrap();
assert_eq!(tokens[0], Token::Variable("X".to_string()));
assert_eq!(tokens[1], Token::Variable("Y".to_string()));
assert_eq!(tokens[2], Token::Variable("COUNTER".to_string()));
}
#[test]
fn test_operators() {
let mut lexer = Lexer::new("+ - * / = > <");
let tokens = lexer.tokenize().unwrap();
assert_eq!(tokens[0], Token::Plus);
assert_eq!(tokens[1], Token::Minus);
assert_eq!(tokens[2], Token::Star);
assert_eq!(tokens[3], Token::Slash);
assert_eq!(tokens[4], Token::Equal);
assert_eq!(tokens[5], Token::Greater);
assert_eq!(tokens[6], Token::Less);
}
#[test]
fn test_complex_expression() {
let mut lexer = Lexer::new("10 PRINT X + 5");
let tokens = lexer.tokenize().unwrap();
assert_eq!(tokens.len(), 5);
assert_eq!(tokens[0], Token::Number(10));
assert_eq!(tokens[1], Token::Print);
assert_eq!(tokens[2], Token::Variable("X".to_string()));
assert_eq!(tokens[3], Token::Plus);
assert_eq!(tokens[4], Token::Number(5));
}
#[test]
fn test_string_with_spaces() {
let mut lexer = Lexer::new(r#"PRINT "Hello World""#);
let tokens = lexer.tokenize().unwrap();
assert_eq!(tokens[0], Token::Print);
assert_eq!(tokens[1], Token::String("HELLO WORLD".to_string()));
}
#[test]
fn test_invalid_character() {
let mut lexer = Lexer::new("@");
let result = lexer.tokenize();
assert!(result.is_err());
if let Err(str) = result {
assert!(str.contains("Unexpected character"));
} else {
panic!("Expected ParseError");
}
}
#[test]
fn test_complete_line() {
let mut lexer = Lexer::new(r#"10 LET X = 5"#);
let tokens = lexer.tokenize().unwrap();
assert_eq!(tokens.len(), 5);
assert_eq!(tokens[0], Token::Number(10));
assert_eq!(tokens[1], Token::Let);
assert_eq!(tokens[2], Token::Variable("X".to_string()));
assert_eq!(tokens[3], Token::Equal);
assert_eq!(tokens[4], Token::Number(5));
}
#[test]
fn test_if_statement() {
let mut lexer = Lexer::new("30 IF X > 10 THEN 50");
let tokens = lexer.tokenize().unwrap();
assert_eq!(tokens.len(), 7);
assert_eq!(tokens[0], Token::Number(30));
assert_eq!(tokens[1], Token::If);
assert_eq!(tokens[2], Token::Variable("X".to_string()));
assert_eq!(tokens[3], Token::Greater);
assert_eq!(tokens[4], Token::Number(10));
assert_eq!(tokens[5], Token::Then);
assert_eq!(tokens[6], Token::Number(50));
}
}
When you run the tests with cargo test lexer, you should see all tests passing, confirming that our lexer works correctly across a variety of cases.
Trying the Lexer
Finally, let’s modify the main.rs to try our lexer interactively:
use basic_interpreter::lexer::Lexer;
fn main() {
let input = r#"10 PRINT "Hello, World!""#;
println!("Input: {}", input);
println!();
let mut lexer = Lexer::new(input);
match lexer.tokenize() {
Ok(tokens) => {
println!("Tokens:");
for token in tokens {
println!(" {:?}", token);
}
}
Err(e) => {
println!("Error: {}", e);
}
}
}
When you run the program with cargo run, you should see the following output:
Input: 10 PRINT "Hello, World!"
Tokens:
Number(10)
Print
String("HELLO, WORLD!")
Summary
In this part, we looked into building a lexer that transforms raw text into tokens. To summarize:
- Tokens are the fundamental units of a programming language. A lexer converts text into tokens.
- We defined token types using an enum with variants for literals, keywords, and operators.
- We used peekable iterators to look ahead at characters without consuming them.
- The peek-and-consume pattern is essential for lexing multi-character tokens.
- We implemented helper methods to read numbers, strings, and words.
- We wrote tests to ensure our lexer works correctly.
Next, we’ll look into parsing these tokens into an Abstract Syntax Tree (AST), which represents the structure of our programs.