Before diving into tokenization, let’s understand the complete SQL processing pipeline:
Lexical Analysis (Tokenizer): Converts raw SQL text into a stream of tokens
Syntax Analysis (Parser): Builds an Abstract Syntax Tree (AST) from tokens
Semantic Analysis: Validates the AST for type correctness and schema compliance
Query Planning: Optimizes the query and creates an execution plan
Execution: Actually runs the query against the data
Token Intro
Previously, we defined the SQL layer for three main function, Parser, Planner, Executor. Before we can parse the SQL, first we need to breaking down the SQL sentence into meaningful words/units. These words all called Token, And the process is call Tokenization.
For example, break SQL sentence "SELECT * FROM users WHERE id = 1;" into discrete tokens is like SELECT, *, FROM, users, WHERE, id, =, 1, ;.
Since Token is part of the SQL layer, we will place the package in SQL folder.
// Operators and Punctuation L_BRACKET, R_BRACKET, SEMICOLON, COMMA, DOT, ASTERISK, PLUS, MINUS, DIVISION, GREATER_THAN, LESS_THAN, GREATER_EQUAL_TO, LESS_EQUAL_TO, EQUAL, NOT_EQUAL, TRUE, FALSE, // Special tokens ILLEGAL, END }
Currently, I’ve only support only some of the SQL tokens, not all. And these token might not be fully used later but I just implemented in case I implemented their usage later.
Tokenizer Architecture
Tokenizer class is responsible for converting SQL text into tokens:
private Token getPunct() { int startPos = currentPosition; while (currentPosition < sql.Length && CharUtils.IsPunct(sql[currentPosition])) { currentPosition++; if (currentPosition - startPos == 1 && (sql[startPos] == '+' || sql[startPos] == '-' || sql[startPos] == '*' || sql[startPos] == '/' || sql[startPos] == ',' || sql[startPos] == ';' || sql[startPos] == '(' || sql[startPos] == ')')) { break; } } string str = sql.Substring(startPos, currentPosition - startPos); return str switch { "+" => new Token(TokenType.PLUS, str), "-" => new Token(TokenType.MINUS, str), "*" => new Token(TokenType.ASTERISK, str), "=" => new Token(TokenType.EQUAL, str), ">=" => new Token(TokenType.GREATER_EQUAL_TO, str), "<=" => new Token(TokenType.LESS_EQUAL_TO, str), "!=" => new Token(TokenType.NOT_EQUAL, str), "<>" => new Token(TokenType.NOT_EQUAL, str), // ... more operators _ => new Token(TokenType.ILLEGAL, str), }; }
This method handles:
Single-character operators: +, -, *, =
Multi-character operators: >=, <=, !=, <>
Punctuation: (, ), ,, ;
Token Retrieval
The GetNextToken() method provides sequential access to tokens:
1 2 3 4 5 6
public Token? GetNextToken() { if (currentTokenPosition >= tokens.Count) { returnnew Token(TokenType.END, ""); } return tokens[currentTokenPosition ++]; }
Example Usage
Here’s how the tokenizer processes a simple SQL statement:
Input: "SELECT name FROM users WHERE id = 1"
Output tokens:
SELECT → TokenType.SELECT
name → TokenType.ID
FROM → TokenType.FROM
users → TokenType.ID
WHERE → TokenType.WHERE
id → TokenType.ID
= → TokenType.EQUAL
1 → TokenType.INT
; → TokenType.SEMICOLON
→ TokenType.END
What’s Next?
In the next part of this series, we’ll explore how these tokens are consumed by the parser to build an Abstract Syntax Tree (AST), transforming the flat token stream into a hierarchical representation of the SQL query’s structure.
The tokenizer serves as the foundation for all subsequent processing stages, making its correctness and efficiency crucial for the entire database engine’s performance.