Welcome back to our database engine series! In the previous articles, we’ve built a solid foundation:
Now it’s time to add intelligence to our database engine. Having an AST is great, but we need to ensure it makes sense semantically. Can we actually execute this query? Do the referenced tables and columns exist? Are the types compatible?
In this article, we’ll build several critical components organized into a clean architecture:
SemanticAnalyzer.cs
- The main validator that ensures SQL queries are meaningfulExpressionType.cs
- Type system definitions for scalars, row sets, and unknownsExpressionTypeInferrer.cs
- The smart engine that automatically determines expression typesIn earlier iterations, semantic analysis was handled by a single monolithic file. As our database engine grew more sophisticated, we refactored it into a clean package structure:
1 | Sql/Semantic/ |
Why the refactor?
This modular approach makes our semantic analysis system both powerful and maintainable.
By the end of this article, our database engine will be able to catch errors like:
SELECT name FROM nonexistent_table
❌ SELECT age + "hello"
❌ (can’t add number and string)SELECT COUNT(name, age)
❌ (COUNT takes only one argument)Let’s explore each component!
Think of the Catalog Manager as the database’s “memory system.” Just like how we remember what friends we have and their phone numbers, our database needs to remember what tables exist, what columns they have, and what functions are available.
Consider this simple query:
1 | SELECT name, age FROM users WHERE age > 18; |
To validate this query, we need to answer:
users
table exist?users
have name
and age
columns?age
column? (to validate age > 18
)Without a catalog, our database would be like a librarian with amnesia - unable to help anyone find anything!
Let’s start with the interface that defines what our catalog should do:
1 | public interface ICatalogManager { |
Now let’s implement it:
1 | public class CatalogManager : ICatalogManager { |
Key Design Decisions:
1 | public void CreateTable(string tableName, List<ColumnDefinition> columns) { |
1 | public class TableSchema { |
Our database needs to know about SQL functions like COUNT
, SUM
, AVG
, etc. Let’s build a flexible function registry:
1 | public class FunctionParameter { |
1 | public class FunctionDefinition { |
1 | private void RegisterBuiltinFunctions() { |
Before we can validate queries, we need a way to represent and work with different types of expressions. SQL has some interesting type complexities:
42
, "hello"
, age + 5
SELECT name FROM users
(returns multiple rows)1 | public enum TypeKind { Scalar, RowSet, Unknown } |
Examples:
42
→ Scalar(Int, false)
"hello"
→ Scalar(String, false)
SELECT name FROM users
→ RowSet([Scalar(String, true)])
Unknown()
This type system is defined in ExpressionType.cs
and forms the foundation for our type inference engine.
Now that we have our catalog to store schema information and our type system to represent expression types, let’s explore the semantic analysis package. This is where the magic happens - we traverse our AST and validate that everything makes semantic sense.
Our semantic analysis is now cleanly organized into three specialized files:
SemanticAnalyzer.cs
- The Main ValidatorThe core validation engine that implements the Visitor pattern to traverse AST nodes and validate:
ExpressionType.cs
- The Type FoundationDefines our three-tier type system with factory methods for easy type creation:
Scalar
types for single valuesRowSet
types for subquery results Unknown
types for inference failuresExpressionTypeInferrer.cs
- The Smart DetectiveThe automatic type inference engine that determines result types for:
Our semantic analyzer in SemanticAnalyzer.cs
implements the visitor pattern to traverse AST nodes:
1 | class SemanticAnalyzer : IVisitor { |
Key Design Points:
Let’s start with the simpler statement types - each has its own specific validation challenges:
CREATE TABLE statements require validating the rationality of table structure definitions:
Table Existence Check
First, check whether the table name to be created already exists. Databases don’t allow creating tables with the same name, as this would cause schema conflicts.
Column Definition Validation
name
columns simultaneouslyConstraint Logic Consistency
DROP TABLE is relatively simple but still requires key checks:
Table Existence Verification
For each table name listed in the DROP statement, the validator ensures that the table actually exists in the catalog. Attempting to delete non-existent tables produces errors.
Batch Operation Handling
Supports batch deletion like DROP TABLE users, products, orders
, with the validator checking each table name’s existence individually.
INSERT statement validation is most complex because it involves type matching:
Target Table Validation
Ensure the table for data insertion exists in the database.
Column-Value Matching Strategy
Handle two INSERT syntaxes:
INSERT INTO users (name, age) VALUES (...)
INSERT INTO users VALUES (...)
(following table definition column order)Quantity Consistency Check
The number of values in each row must exactly match the specified number of columns. INSERT INTO users (name, age) VALUES ('John', 25, 'extra')
will be rejected.
Type Compatibility Validation
This is the most critical part:
boolean
columns UPDATE statements require multi-layer validation:
Target Table Existence
Verify that the table being updated actually exists.
SET Clause Validation
For each column = value
pair:
WHERE Clause Handling
If a WHERE clause exists, must:
boolean
valuesUpdate Constraint Check
DELETE statement validation focuses on:
Target Table Validation
Ensure the table from which data is to be deleted exists.
WHERE Clause Criticality
boolean
return typeReferential Integrity Considerations
Although our system hasn’t implemented foreign keys yet, the DELETE validator reserves extension points for future referential integrity checks.
SELECT statements are where the Visitor pattern really shows its power. Here’s how our SemanticAnalyzer
tackles this multi-layered validation challenge:
The Visitor’s Journey Through SELECT
When our visitor encounters a SELECT node, it follows a carefully orchestrated five-phase approach:
Phase 1: Building Table Context
The visitor first clears its current table context and processes the FROM clause. For each table or alias mentioned, it validates that:
FROM users u, products u
)This phase builds up the _currentTableSchemas
dictionary that becomes our reference for all subsequent column validation.
Phase 2: SELECT List Validation
Here’s where things get interesting. The visitor handles three scenarios:
*
): Always valid if we have tables in FROMtable.*
): Must verify the table exists in our FROM clausePhase 3: WHERE Clause Logic
If a WHERE clause exists, the visitor recursively validates all its expressions, then applies a critical constraint: the final result must be a boolean expression. You can’t have WHERE name
(string) or WHERE age
(int) - it must be something that evaluates to true/false.
Phase 4 & 5: Column Reference Validation
For GROUP BY and ORDER BY clauses, the visitor ensures every referenced column actually exists and is unambiguous within our current table context.
The Key Insight: Notice how the visitor maintains state (_currentTableSchemas
) as it traverses the tree. This context allows later validation phases to reference what was learned in earlier phases.
This is where our Visitor pattern truly shines. Each expression type has its own validation rules, and the visitor applies them recursively as it traverses the expression tree.
Column references present two classic database validation scenarios:
Qualified References (users.name
)
When a column reference specifies a table name, the validator ensures:
Unqualified References (name
)
This is where it gets tricky. The validator must:
The ambiguity check is crucial - SELECT name FROM users, products
fails because both tables might have a name
column. SQL requires you to be explicit: users.name
or products.name
.
Binary expressions require a two-stage validation approach:
Stage 1: Recursive Validation
The visitor first validates both operands by calling Accept()
on the left and right expressions. This ensures the operands themselves are valid before checking compatibility.
Stage 2: Type Compatibility Logic
Here’s where our type system rules come into play:
Equality Operations (=
, !=
)
Comparison Operations (<
, <=
, >
, >=
)
true > false
mean?)Logical Operations (AND
, OR
)
boolean
expressionsboolean
conversionArithmetic Operations (+
, -
, *
, /
)
Unary expressions, while simple, have important validation rules:
NOT Operator Validation
boolean
expressionNOT name
(String) or NOT age
(Int) are both illegalboolean
expressions like NOT (age > 18)
are validNumeric Negation Operator (-
)
boolean
values cannot be numerically negatedBETWEEN expressions require coordination of three operand types:
Ternary Type Consistency
For column BETWEEN value1 AND value2
:
Boundary Value Logic Check
While semantic analysis doesn’t validate value1 <= value2
, it ensures the type system supports such comparison operations.
IN expressions handle set membership checks:
Value List Type Consistency
For column IN (value1, value2, value3)
:
age IN (18, 25.5, 30)
is legal (Int and Float mixed)Subquery IN Operations
For column IN (SELECT ...)
:
Literal expressions are the simplest but most important foundation:
Type Determinism
Each literal has a clear type mapping:
NULL Value Special Handling
NULL literal type inference depends on usage context, providing important information for the type inferrer.
Subquery expressions are the most complex because they create nested scopes:
Context Isolation
Usage Context Constraints
Different constraints are imposed based on subquery usage location:
WHERE age > (SELECT ...)
): Must return single row, single columnWHERE (a,b) IN (SELECT ...)
): Can return multiple rows but column count must matchFROM (SELECT ...)
): Can return arbitrary row-column structureStar expressions have special usage rules:
Context Restrictions
SELECT *
)COUNT(*)
) WHERE *
is illegal)Table Qualification Check
table.*
requires verifying the table name exists in the FROM clause*
is always legal when there’s a table contextFunction calls present the most complex validation scenario because they involve multiple interdependent constraints that must all be satisfied.
Layer 1: Existence and Semantic Rules
The visitor first ensures the function exists in our catalog and applies special semantic rules. The most important rule: only COUNT(*)
is allowed to use the star argument - expressions like SUM(*)
or AVG(*)
are rejected because they don’t make mathematical sense.
Layer 2: Recursive Argument Validation
Each function argument is recursively validated by calling Accept()
on it. This ensures that complex expressions like UPPER(users.name || products.description)
have all their sub-expressions properly validated before we check the function-specific rules.
Layer 3: Arity Checking (Argument Count)
Functions have both required and optional parameters. The validator ensures:
COUNT(*)
which bypasses normal parameter countingLayer 4: Type Compatibility Matrix
This is the most sophisticated part. For each argument, the validator:
ExpressionTypeInferrer
The Subquery Challenge: When a function argument is a subquery (like WHERE age > (SELECT AVG(age) FROM users)
), the validator must ensure:
Parameter Type Matching: Each function parameter has an “accepted types” list. For example:
UPPER()
only accepts String typesABS()
only accepts numeric types (Int, Float) COALESCE()
can accept any type, but all arguments must be the same typeThis layered approach ensures that by the time we finish validating a function call, we know with certainty that it will execute successfully at runtime.
Now that we can validate semantics, we need a way to automatically figure out what type each expression will produce. This is crucial for query planning and execution.
In our refactored architecture, I’ve separated the type inference logic into its own dedicated class: ExpressionTypeInferrer
. This gives us better separation of concerns - the SemanticAnalyzer
focuses on validation while the ExpressionTypeInferrer
specializes in determining result types.
Our ExpressionTypeInferrer
acts like a detective, examining expressions and deducing their types:
1 | public class ExpressionTypeInferrer { |
Our ExpressionTypeInferrer
works like a smart detective, using a recursive pattern-matching approach to determine result types. Here’s how it tackles each expression category:
Literals are the simplest case - the inferrer directly maps C# runtime types to our database types:
Key insight: NULL values have “unknown” type initially, but get resolved based on context later.
For column references, the inferrer performs a schema lookup:
users.name
): Direct table + column lookup in our schema dictionaryname
): Search through all available table schemas (semantic analyzer already ensured uniqueness)Nullability determination: A column is nullable unless it has an explicit NOT NULL constraint. This affects all downstream type calculations.
This is where our type system’s logic really shines. The inferrer applies different rules based on operator category:
Comparison Operators (=
, <
, >=
, etc.)
boolean
results regardless of operand typesLogical Operators (AND
, OR
)
Arithmetic Operators (+
, -
, *
, /
)
Function type inference involves several sophisticated rules:
COUNT(*) Exception: Always returns non-nullable Int, regardless of input nullability (empty tables return 0, not NULL).
Aggregate Functions: Generally return nullable results if any input columns are nullable (empty result sets produce NULL).
MIN/MAX Special Case: These functions return the exact same type as their input argument, preserving both base type and nullability.
Type-Preserving Functions: Functions like UPPER()
, LOWER()
return the same base type as input but may affect nullability.
Parameter Type Resolution: For functions with subquery arguments, the inferrer extracts the scalar type from single-column result sets.
Subqueries present the most sophisticated type inference challenge because they return row sets (multiple columns) rather than scalar values. Here’s how our ExpressionTypeInferrer
handles this complexity:
Context Isolation Strategy
The inferrer creates an isolated context for each subquery:
SELECT List Analysis
For each item in the subquery’s SELECT list, the inferrer determines:
*
): Expands to all columns from all tables in the subquery’s scopetable.*
): Expands to all columns from the specified table onlyRow Schema Construction
The final result is a RowSet
type containing a schema (list of column types) that represents what the subquery will produce. This schema becomes critical when the subquery is used in different contexts:
WHERE age > (SELECT AVG(age) ...)
) it must be a single-column, single-row resultWHERE (name, age) IN (SELECT ...)
) it can be multi-columnThe Nested Scope Challenge: The trickiest part is handling column references that might exist in both outer and inner query scopes. Our approach ensures that column resolution within the subquery only sees tables from the subquery’s own FROM clause, maintaining proper SQL scoping rules.
With semantic analysis complete, our database engine now has a solid understanding of what queries mean and whether they’re valid. Our refactored modular architecture with three specialized components (SemanticAnalyzer
, ExpressionType
, and ExpressionTypeInferrer
) provides a clean, maintainable foundation that will serve us well as the system grows.
In the next article, we’ll tackle the Query Planner - the component that decides HOW to execute these validated queries efficiently.
The query planner will: