How Jeffrey Ullman’s core ideas power modern databases: relational algebra, optimization rules, joins, and compiler-style planning that helps systems scale.

Most people who write SQL, build dashboards, or tune a slow query have benefited from Jeffrey Ullman’s work—even if they’ve never heard his name. Ullman is a computer scientist and educator whose research and textbooks helped define how databases describe data, reason about queries, and execute them efficiently.
When a database engine turns your SQL into something it can run fast, it’s relying on ideas that must be both precise and adaptable. Ullman helped formalize the meaning of queries (so the system can rewrite them safely), and he helped connect database thinking with compiler thinking (so a query can be parsed, optimized, and translated into executable steps).
That influence is quiet because it doesn’t show up as a button in your BI tool or a visible feature in your cloud console. It shows up as:
JOINThis post uses Ullman’s core ideas as a guided tour of the database internals that matter most in practice: how relational algebra sits underneath SQL, how query rewrites preserve meaning, why cost-based optimizers make the choices they do, and how join algorithms often decide whether a job finishes in seconds or hours.
We’ll also pull in a few compiler-like concepts—parsing, rewriting, and planning—because database engines behave more like sophisticated compilers than many people realize.
A quick promise: we’ll keep the discussion accurate, but avoid math-heavy proofs. The goal is to give you mental models you can apply at work the next time performance, scaling, or confusing query behavior shows up.
If you’ve ever written a SQL query and expected it to “just mean one thing,” you’re relying on ideas that Jeffrey Ullman helped popularize and formalize: a clean model for data, plus precise ways to describe what a query asks for.
At its core, the relational model treats data as tables (relations). Each table has rows (tuples) and columns (attributes). That sounds obvious now, but the important part is the discipline it creates:
This framing makes it possible to reason about correctness and performance without hand-waving. When you know what a table represents and how rows are identified, you can predict what joins should do, what duplicates mean, and why certain filters change results.
Ullman’s teaching often uses relational algebra as a kind of query calculator: a small set of operations (select, project, join, union, difference) that you can combine to express what you want.
Why it matters to working with SQL: databases translate SQL into an algebraic form and then rewrite it into another equivalent form. Two queries that look different can be algebraically the same—which is how optimizers can reorder joins, push down filters, or remove redundant work while keeping the meaning intact.
SQL is largely “what,” but engines often optimize using algebraic “how.”
SQL dialects vary (Postgres vs. Snowflake vs. MySQL), but the fundamentals don’t. Understanding keys, relationships, and algebraic equivalence helps you spot when a query is logically wrong, when it’s merely slow, and which changes preserve meaning across platforms.
Relational algebra is the “math underneath” SQL: a small set of operators that describe the result you want. Jeffrey Ullman’s work helped make this operator view crisp and teachable—and it’s still the mental model most optimizers use.
A database query can be expressed as a pipeline of a few building blocks:
WHERE idea)SELECT col1, col2 idea)JOIN ... ON ...)UNION)EXCEPT in many SQL dialects)Because the set is small, it becomes easier to reason about correctness: if two algebra expressions are equivalent, they return the same table for any valid database state.
Take a familiar query:
SELECT c.name
FROM customers c
JOIN orders o ON o.customer_id = c.id
WHERE o.total > 100;
Conceptually, this is:
start with a join of customers and orders: customers ⋈ orders
select only orders above 100: σ(o.total > 100)(...)
project the one column you want: π(c.name)(...)
That’s not the exact internal notation used by every engine, but it’s the right idea: SQL becomes an operator tree.
Many different trees can mean the same result. For example, filters can often be pushed earlier (apply σ before a big join), and projections can often drop unused columns sooner (apply π earlier).
Those equivalence rules are what let a database rewrite your query into a cheaper plan without changing the meaning. Once you see queries as algebra, “optimization” stops being magic and becomes safe, rule-guided reshaping.
When you write SQL, the database doesn’t execute it “as written.” It translates your statement into a query plan: a structured representation of the work to be done.
A good mental model is a tree of operators. Leaves read tables or indexes; internal nodes transform and combine rows. Common operators include scan, filter (selection), project (choose columns), join, group/aggregate, and sort.
Databases typically separate planning into two layers:
Ullman’s influence shows up in the emphasis on meaning-preserving transformations: rearrange the logical plan in many ways without changing the answer, then pick an efficient physical strategy.
Before choosing the final execution approach, optimizers apply algebraic “cleanup” rules. These rewrites don’t change results; they reduce unnecessary work.
Common examples:
Suppose you want orders for users in one country:
SELECT o.order_id, o.total
FROM users u
JOIN orders o ON o.user_id = u.id
WHERE u.country = 'CA';
A naïve interpretation might join all users to all orders and then filter to Canada. A meaning-preserving rewrite pushes the filter down so the join touches fewer rows:
country = 'CA'order_id and totalIn plan terms, the optimizer tries to turn:
Join(Users, Orders) → Filter(country='CA') → Project(order_id,total)
into something closer to:
Filter(country='CA') on Users → Join(with Orders) → Project(order_id,total)
Same answer. Less work.
These rewrites are easy to overlook because you never type them—yet they’re a major reason the same SQL can run fast on one database and slow on another.
When you run a SQL query, the database considers multiple valid ways to get the same answer, then chooses the one it expects to be cheapest. That decision process is called cost-based optimization—and it’s one of the most practical places where Ullman-style theory shows up in everyday performance.
A cost model is a scoring system the optimizer uses to compare alternative plans. Most engines estimate cost using a few core resources:
The model doesn’t need to be perfect; it needs to be directionally right often enough to pick good plans.
Before it can score plans, the optimizer asks a question at every step: how many rows will this produce? That’s cardinality estimation.
If you filter WHERE country = 'CA', the engine estimates what fraction of the table matches. If you join customers to orders, it estimates how many pairs will match on the join key. These row-count guesses determine whether it prefers an index scan over a full scan, a hash join over a nested loop, or whether a sort will be small or enormous.
The optimizer’s guesses are driven by statistics: counts, value distributions, null rates, and sometimes correlations between columns.
When stats are stale or missing, the engine can misjudge row counts by orders of magnitude. A plan that looks cheap on paper can become expensive in reality—classic symptoms include sudden slowdowns after data growth, “random” plan changes, or joins that unexpectedly spill to disk.
Better estimates often require more work: more detailed stats, sampling, or exploring more candidate plans. But planning itself costs time, especially for complex queries.
So optimizers balance two goals:
Understanding that trade-off helps you interpret EXPLAIN output: the optimizer isn’t trying to be clever—it’s trying to be predictably correct under limited information.
Ullman’s work helped popularize a simple but powerful idea: SQL isn’t “run” so much as translated into an execution plan. Nowhere is that more obvious than joins. Two queries that return the same rows can differ wildly in runtime depending on which join algorithm the engine chooses—and in what order it joins tables.
Nested loop join is conceptually straightforward: for each row on the left, find matching rows on the right. It can be fast when the left side is small and the right side has a useful index.
Hash join builds a hash table from one input (often the smaller) and probes it with the other. It shines for large, unsorted inputs with equality conditions (e.g., A.id = B.id), but needs memory; spill-to-disk can erase the advantage.
Merge join walks two inputs in sorted order. It’s a great fit when both sides are already ordered (or cheaply sortable), such as when indexes can deliver rows in join-key order.
With three or more tables, the number of possible join orders explodes. Joining two large tables first might create a huge intermediate result that slows everything else. A better order often starts with the most selective filter (fewest rows) and joins outward, keeping intermediates small.
Indexes don’t just speed up lookups—they make certain join strategies viable. An index on the join key can turn an expensive nested loop into a quick “seek per row” pattern. Conversely, missing or unusable indexes may push the engine toward hash joins or large sorts for merge joins.
Databases don’t just “run SQL.” They compile it. Ullman’s influence spans both database theory and compiler thinking, and that connection explains why query engines behave like programming language toolchains: they translate, rewrite, and optimize before doing any work.
When you send a query, the first step looks like a compiler’s front end. The engine tokenizes keywords and identifiers, checks grammar, and builds a parse tree (often simplified into an abstract syntax tree). This is where basic errors are caught: missing commas, ambiguous column names, invalid grouping rules.
A helpful mental model: SQL is a programming language whose “program” happens to describe data relationships instead of loops.
Compilers convert syntax into an intermediate representation (IR). Databases do something similar: they translate SQL syntax into logical operators such as:
GROUP BY)That logical form is closer to relational algebra than SQL text, which makes it easier to reason about meaning and equivalence.
Compiler optimizations keep program results identical while making execution cheaper. Database optimizers do the same, using rule systems like:
This is the database version of “dead code elimination”: not identical techniques, but the same philosophy—preserve semantics, reduce cost.
If your query is slow, don’t stare at SQL alone. Look at the query plan the way you’d inspect compiler output. A plan tells you what the engine actually chose: join order, index usage, and where time is spent.
Practical takeaway: learn to read EXPLAIN output as a performance “assembly listing.” It turns tuning from guesswork into evidence-based debugging. For more on turning that into a habit, see /blog/practical-query-optimization-habits.
Good query performance often starts before you write SQL. Ullman’s schema design theory (especially normalization) is about structuring data so the database can keep it correct, predictable, and efficient as it grows.
Normalization aims to:
Those correctness wins translate into performance wins later: fewer duplicated fields, smaller indexes, and fewer expensive updates.
You don’t need to memorize proofs to use the ideas:
Denormalization can be a smart choice when:
The key is to denormalize deliberately, with a process to keep duplicates in sync.
Schema design shapes what the optimizer can do. Clear keys and foreign keys enable better join strategies, safer rewrites, and more accurate row-count estimates. Meanwhile, excessive duplication can bloat indexes and slow writes, and multi-valued columns block efficient predicates. As data volume grows, these early modeling decisions often matter more than micro-optimizing a single query.
When a system “scales,” it’s rarely just about adding bigger machines. More often, the hard part is that the same query meaning must be preserved while the engine chooses a very different physical strategy to keep runtimes predictable. Ullman’s emphasis on formal equivalences is exactly what allows those strategy changes without changing results.
At small sizes, many plans “work.” At scale, the difference between scanning a table, using an index, or using a precomputed result can be the difference between seconds and hours. The theory side matters because the optimizer needs a safe set of rewrite rules (e.g., pushing filters earlier, reordering joins) that don’t alter the answer—even if they radically alter the work performed.
Partitioning (by date, customer, region, etc.) turns one logical table into many physical pieces. That affects planning:
The SQL text may be unchanged, but the best plan now depends on where the rows live.
Materialized views are essentially “saved subexpressions.” If the engine can prove that your query matches (or can be rewritten to match) a stored result, it can replace expensive work—like repeated joins and aggregations—with a fast lookup. This is relational algebra thinking in practice: recognize equivalence, then reuse.
Caching can speed up repeated reads, but it won’t rescue a query that must scan too much data, shuffle huge intermediate results, or compute a giant join. When scale issues appear, the fix is frequently: reduce the amount of data touched (layout/partitioning), reduce repeated computation (materialized views), or change the plan—not just “add cache.”
Ullman’s influence shows up in a simple mindset: treat a slow query as a statement of intent that the database is free to rewrite, then verify what it actually decided to do. You don’t need to be a theorist to benefit—you just need a repeatable routine.
Start with the parts that usually dominate runtime:
If you only do one thing, identify the first operator where the row count explodes. That’s usually the root cause.
These are easy to write and surprisingly costly:
WHERE LOWER(email) = ... can prevent index usage (use a normalized column or a functional index if supported).Relational algebra encourages two practical moves:
WHERE conditions before joins whenever possible to shrink inputs.A good hypothesis sounds like: “This join is expensive because we’re joining too many rows; if we filter orders to the last 30 days first, the join input drops.”
Use a simple decision rule:
EXPLAIN shows avoidable work (unnecessary joins, late filtering, non-sargable predicates).The goal isn’t “clever SQL.” It’s predictable, smaller intermediate results—exactly the kind of equivalence-preserving improvement Ullman’s ideas make easier to spot.
These concepts aren’t just for database administrators. If you’re shipping an application, you’re making database and query-planning decisions whether you realize it or not: schema shape, key choices, query patterns, and the data access layer all influence what the optimizer can do.
If you’re using a vibe-coding workflow (for example, generating a React + Go + PostgreSQL app from a chat interface in Koder.ai), Ullman-style mental models are a practical safety net: you can review the generated schema for clean keys and relationships, inspect the queries your app relies on, and validate performance with EXPLAIN before problems show up in production. The faster you can iterate on “query intent → plan → fix,” the more value you get from accelerated development.
You don’t need to “study theory” as a separate hobby. The fastest way to benefit from Ullman-style fundamentals is to learn just enough to read query plans confidently—and then practice on your own database.
Search for these books and lecture topics (no affiliation—just widely cited starting points):
Start small and keep each step tied to something you can observe:
Pick 2–3 real queries and iterate:
IN to EXISTS, push predicates earlier, remove unnecessary columns, compare results.Use clear, plan-based language:
That’s the practical payoff of Ullman’s foundations: you get a shared vocabulary to explain performance—without guessing.
Jeffrey Ullman helped formalize how databases represent query meaning and how they can safely transform queries into faster equivalents. That foundation shows up every time an engine rewrites a query, reorders joins, or picks a different execution plan while guaranteeing the same result set.
Relational algebra is a small set of operators (select, project, join, union, difference) that precisely describe query results. Engines commonly translate SQL into an algebra-like operator tree so they can apply equivalence rules (like pushing filters earlier) before choosing an execution strategy.
Because optimization depends on proving that a rewritten query returns the same results. Equivalence rules let the optimizer do things like:
WHERE filters before a joinThese changes can cut work dramatically without changing meaning.
A logical plan describes what operations are needed (filter, join, aggregate) independent of storage details. A physical plan chooses how to run them (index scan vs. full scan, hash join vs. nested loop, parallelism, sort strategies). Most performance differences come from physical choices, enabled by logical rewrites.
Cost-based optimization evaluates multiple valid plans and chooses the one with the lowest estimated cost. Costs are usually driven by practical factors like rows processed, I/O, CPU, and memory (including whether a hash or sort spills to disk).
Cardinality estimation is the optimizer’s guess of “how many rows will come out of this step?” Those estimates determine join order, join type, and whether an index scan is worthwhile. When estimates are wrong (often due to stale/missing statistics), you can get sudden slowdowns, big spills, or surprising plan changes.
Focus on a few high-signal clues:
Treat the plan like compiled output: it shows what the engine actually decided to do.
Normalization reduces duplicated facts and update anomalies, which often means smaller tables and indexes and more reliable joins. Denormalization can still be right for analytics or repeated read-heavy patterns, but it should be deliberate (clear refresh rules, known redundancy) so correctness doesn’t degrade over time.
Scale often requires changing physical strategy while keeping query meaning identical. Common tools include:
Caching helps repeated reads, but it won’t fix a query that must touch too much data or produces huge intermediate joins.