Correlated Subqueries
Use correlated subqueries that reference outer query values.
Understanding the Difference
Let's start with sample data:
id | name | department | salary |
|---|---|---|---|
| 1 | Alice | Engineering | 85000 |
| 2 | Bob | Engineering | 95000 |
| 3 | Charlie | Sales | 70000 |
| 4 | Diana | Sales | 65000 |
| 5 | Eve | Marketing | 75000 |
| 6 | Frank | Marketing | 80000 |
Regular subquery (runs once):
-- Find employees earning more than company average
SELECT name, salary
FROM employees
WHERE salary > (SELECT AVG(salary) FROM employees);The subquery (SELECT AVG(salary) FROM employees) runs once, calculates 78,333.33, then the outer query compares each salary to that fixed value.
Correlated subquery (runs for each row):
-- Find employees earning more than their department's average
SELECT name, department, salary
FROM employees e1
WHERE salary > (
SELECT AVG(salary)
FROM employees e2
WHERE e2.department = e1.department
);Notice e2.department = e1.department - the subquery references the outer query's e1.department. For each employee:
- Alice (Engineering): Subquery calculates AVG for Engineering (90,000). 85,000 < 90,000 → NOT included
- Bob (Engineering): AVG for Engineering (90,000). 95,000 > 90,000 → INCLUDED
- Charlie (Sales): AVG for Sales (67,500). 70,000 > 67,500 → INCLUDED
- Diana (Sales): AVG for Sales (67,500). 65,000 < 67,500 → NOT included
- Eve (Marketing): AVG for Marketing (77,500). 75,000 < 77,500 → NOT included
- Frank (Marketing): AVG for Marketing (77,500). 80,000 > 77,500 → INCLUDED
The subquery runs 6 times, once per employee, each time with a different department.
Key indicator: If you see the subquery using a column from the outer query (like e1.department), it's correlated!
Performance Considerations
Correlated subqueries run once per outer row, which can be slow on large datasets.
Performance impact: If the outer query returns 10,000 rows, the correlated subquery runs 10,000 times! For large tables, this can be very slow.
Optimization strategies:
1. Use JOINs instead (when possible):
-- Correlated subquery (slower)
SELECT name, salary
FROM employees e1
WHERE salary > (
SELECT AVG(salary)
FROM employees e2
WHERE e2.department = e1.department
);
-- JOIN with GROUP BY (faster)
SELECT e.name, e.salary
FROM employees e
JOIN (
SELECT department, AVG(salary) AS avg_salary
FROM employees
GROUP BY department
) dept_avg ON e.department = dept_avg.department
WHERE e.salary > dept_avg.avg_salary;The JOIN version calculates each department's average once, then joins. Much faster!
2. Use window functions (covered later):
-- Even better: window functions
SELECT name, salary
FROM (
SELECT
name,
salary,
department,
AVG(salary) OVER (PARTITION BY department) AS dept_avg
FROM employees
) subquery
WHERE salary > dept_avg;Window functions calculate the average in a single pass - the most efficient approach!
3. Add indexes on correlated columns:
If you must use correlated subqueries, index the columns used in the correlation:
-- Index the department column for faster correlation
CREATE INDEX idx_employees_department ON employees(department);Best practice: Use correlated subqueries for readability when working with small datasets. For large datasets or production systems, prefer JOINs or window functions.
Common Mistakes to Avoid
Don't make these mistakes:
-
Forgetting table aliases: ❌
- Wrong:
WHERE department = department - This compares the column to itself, not outer to inner
- Fix: Use aliases:
WHERE e2.department = e1.department
- Wrong:
-
Using aggregate without GROUP BY in subquery: Usually OK
- Correlated subqueries with aggregates don't need GROUP BY
- The correlation (WHERE clause) effectively groups the data
-
Confusing EXISTS with IN: Different uses
EXISTSchecks if any rows match (boolean)INchecks if a value is in a set- EXISTS is often faster for correlated checks
-
Over-using correlated subqueries: ❌
- They're powerful but can be slow
- Check if a JOIN or window function works instead
-
Not testing on large datasets: ❌
- Correlated subquery works fine on 100 rows
- Becomes unusably slow on 1,000,000 rows
- Always test performance!
-
Expecting subquery to return multiple columns: ❌
- Wrong:
WHERE (col1, col2) = (SELECT col1, col2 FROM ...) - In most contexts, subquery must return a single value
- Use EXISTS or IN for multi-column correlations
- Wrong:
-
Missing the correlation condition: ❌
- Forgetting
WHERE e2.dept = e1.deptmakes it uncorrelated - Results will be wrong!
- Forgetting
Practical Examples
Example 1: Find products that have never been ordered
SELECT product_id, product_name
FROM products p
WHERE NOT EXISTS (
SELECT 1
FROM orders o
WHERE o.product_id = p.product_id
);Example 2: Find employees hired before anyone in their department
id | name | department | hire_date |
|---|---|---|---|
| 1 | Alice | Engineering | 2020-01-15 |
| 2 | Bob | Engineering | 2021-03-20 |
| 3 | Charlie | Sales | 2019-06-10 |
| 4 | Diana | Sales | 2022-02-01 |
| 5 | Eve | Marketing | 2020-11-05 |
-- Find the most senior employee in each department
SELECT name, department, hire_date
FROM employees_with_dates e1
WHERE NOT EXISTS (
SELECT 1
FROM employees_with_dates e2
WHERE e2.department = e1.department
AND e2.hire_date < e1.hire_date
);For each employee, we check: "Is there anyone in my department hired earlier?" If no, they're the most senior!
Example 3: Customers with above-average order frequency
-- Customers who order more frequently than average
SELECT
c.customer_id,
c.name,
(SELECT COUNT(*) FROM orders o WHERE o.customer_id = c.customer_id) AS order_count
FROM customers c
WHERE (
SELECT COUNT(*)
FROM orders o
WHERE o.customer_id = c.customer_id
) > (
SELECT AVG(order_count)
FROM (
SELECT COUNT(*) AS order_count
FROM orders
GROUP BY customer_id
) AS counts
);Alice has 2 orders, Bob and Charlie have 1 each. Average = (2+1+1)/3 = 1.33. Alice is above average!
Key Takeaways
What you learned:
✅ Correlated subqueries reference columns from the outer query ✅ They execute once per row of the outer query ✅ EXISTS and NOT EXISTS are common with correlated subqueries ✅ Use table aliases to distinguish outer and inner queries ✅ Correlated subqueries enable row-by-row comparisons (e.g., comparing to group averages) ✅ They're powerful but can be slow on large datasets ✅ JOINs and window functions are often faster alternatives ✅ Common patterns: above group average, top N per group, finding duplicates, checking existence
When to use correlated subqueries:
- Comparing each row to its group's statistics
- Checking for existence (EXISTS/NOT EXISTS)
- Finding top N per category
- Detecting duplicates
- When readability matters more than performance (small datasets)
When to avoid:
- Large datasets (millions of rows)
- When a JOIN or window function can do the job
- Production queries needing optimal performance
Coming up next: You'll learn about set operations (UNION, INTERSECT, EXCEPT) to combine and compare entire result sets!
Practice Exercise: Try these queries:
- Find products more expensive than their category average
- List customers who have placed more orders than the average customer
- Find employees hired before anyone else in their department
- Identify customers who have ordered every product
- Find products that have been ordered but never returned