Correlated Subqueries

Use correlated subqueries that reference outer query values.

20 min read
Intermediate

Introduction to Correlated Subqueries

You've learned about regular subqueries - they run once and return a result that the outer query uses. But what if you need the subquery to run for each row of the outer query, using values from that row?

That's a correlated subquery - it references columns from the outer query, creating a row-by-row dependency.

Real-world scenario: "Find employees who earn more than the average salary in their department."

You can't use a simple subquery because the average is different for each department. You need a subquery that references the current employee's department - that's correlation!

A subquery that references columns from the outer query. It's executed once for each row processed by the outer query, using values from that specific row.

Regular vs Correlated Subqueries
Feature
Regular Subquery
Correlated Subquery
ExecutionRuns once, independentlyRuns for each outer row
ReferencesNo outer query columnsReferences outer query columns
PerformanceUsually fasterCan be slower (row-by-row)
Use CaseFixed comparisonsRow-dependent comparisons

Understanding the Difference

Let's start with sample data:

employees
id
name
department
salary
1AliceEngineering85000
2BobEngineering95000
3CharlieSales70000
4DianaSales65000
5EveMarketing75000
6FrankMarketing80000

Regular subquery (runs once):

sql
-- Find employees earning more than company average
SELECT name, salary
FROM employees
WHERE salary > (SELECT AVG(salary) FROM employees);

The subquery (SELECT AVG(salary) FROM employees) runs once, calculates 78,333.33, then the outer query compares each salary to that fixed value.

Correlated subquery (runs for each row):

sql
-- Find employees earning more than their department's average
SELECT name, department, salary
FROM employees e1
WHERE salary > (
  SELECT AVG(salary)
  FROM employees e2
  WHERE e2.department = e1.department
);

Notice e2.department = e1.department - the subquery references the outer query's e1.department. For each employee:

  1. Alice (Engineering): Subquery calculates AVG for Engineering (90,000). 85,000 < 90,000 → NOT included
  2. Bob (Engineering): AVG for Engineering (90,000). 95,000 > 90,000 → INCLUDED
  3. Charlie (Sales): AVG for Sales (67,500). 70,000 > 67,500 → INCLUDED
  4. Diana (Sales): AVG for Sales (67,500). 65,000 < 67,500 → NOT included
  5. Eve (Marketing): AVG for Marketing (77,500). 75,000 < 77,500 → NOT included
  6. Frank (Marketing): AVG for Marketing (77,500). 80,000 > 77,500 → INCLUDED

The subquery runs 6 times, once per employee, each time with a different department.

Key indicator: If you see the subquery using a column from the outer query (like e1.department), it's correlated!

Correlated Subquery Syntax

Structure:

sql
SELECT column1, column2
FROM table1 outer_alias
WHERE column > (
  SELECT aggregate_function(column)
  FROM table2 inner_alias
  WHERE inner_alias.column = outer_alias.column
);

Key components:

  1. Outer query alias (outer_alias) - identifies rows from the outer query
  2. Inner query alias (inner_alias) - identifies rows in the subquery
  3. Correlation condition - inner_alias.column = outer_alias.column
  4. The subquery references the outer alias

Important: You must use table aliases (e1, e2, etc.) to distinguish outer and inner references!

EXISTS with Correlated Subqueries

EXISTS checks if a correlated subquery returns any rows. It's extremely common with correlated subqueries.

orders
order_id
customer_id
product_id
quantity
order_date
1101122024-01-15
2102212024-01-20
3101352024-02-01
4103112024-02-05
customers
customer_id
name
email
101Alicealice@example.com
102Bobbob@example.com
103Charliecharlie@example.com
104Dianadiana@example.com

Find customers who have placed orders:

sql
SELECT customer_id, name
FROM customers c
WHERE EXISTS (
  SELECT 1
  FROM orders o
  WHERE o.customer_id = c.customer_id
);

For each customer, the subquery checks: "Are there any orders with this customer_id?" If yes, include the customer. Diana (104) has no orders, so she's excluded.

Find customers who have NOT placed orders:

sql
SELECT customer_id, name
FROM customers c
WHERE NOT EXISTS (
  SELECT 1
  FROM orders o
  WHERE o.customer_id = c.customer_id
);

NOT EXISTS is the opposite - it returns rows where the subquery finds no matches.

Pro tip: In EXISTS subqueries, SELECT 1 is conventional. The actual value doesn't matter - EXISTS only cares whether rows are returned, not what data they contain.

Common Patterns with Correlated Subqueries

Pattern 1: Find items above group average

products
product_id
product_name
category
price
1LaptopElectronics999.99
2MouseElectronics29.99
3Desk ChairFurniture199.99
4MonitorElectronics349.99
5Desk LampFurniture39.99
6KeyboardElectronics79.99
sql
-- Products more expensive than their category average
SELECT product_name, category, price
FROM products p1
WHERE price > (
  SELECT AVG(price)
  FROM products p2
  WHERE p2.category = p1.category
);

Pattern 2: Find top N per group

sql
-- Employees who are in the top 2 highest paid in their department
SELECT name, department, salary
FROM employees e1
WHERE (
  SELECT COUNT(*)
  FROM employees e2
  WHERE e2.department = e1.department
    AND e2.salary > e1.salary
) < 2
ORDER BY department, salary DESC;

For each employee, this counts how many employees in the same department earn more. If fewer than 2 earn more, they're in the top 2!

Pattern 3: Find duplicates

orders_raw
order_id
customer_id
order_date
total
11012024-02-01150.00
21022024-02-01200.00
31012024-02-01150.00
41032024-02-02100.00
sql
-- Find duplicate orders (same customer, date, and total)
SELECT *
FROM orders_raw o1
WHERE EXISTS (
  SELECT 1
  FROM orders_raw o2
  WHERE o2.customer_id = o1.customer_id
    AND o2.order_date = o1.order_date
    AND o2.total = o1.total
    AND o2.order_id != o1.order_id
);

Orders 1 and 3 have identical customer, date, and total values - likely duplicates!

o2.order_id != o1.order_id ensures we don't match a row with itself.

Scalar Correlated Subqueries

Correlated subqueries can appear in the SELECT clause, returning a value for each row:

sql
-- Show each employee with their department's average salary
SELECT
  name,
  department,
  salary,
  (SELECT AVG(salary)
   FROM employees e2
   WHERE e2.department = e1.department) AS dept_avg_salary,
  salary - (SELECT AVG(salary)
            FROM employees e2
            WHERE e2.department = e1.department) AS difference_from_avg
FROM employees e1
ORDER BY department, salary DESC;

Each row calculates its department's average on-the-fly. This makes it easy to see who's above or below their department average!

Correlated Subqueries with IN

You can use IN with correlated subqueries for complex filtering:

sql
-- Find customers who ordered product 1
SELECT customer_id, name
FROM customers c
WHERE customer_id IN (
  SELECT customer_id
  FROM orders
  WHERE product_id = 1
);

This isn't technically correlated (the subquery doesn't reference c), but let's make it correlated:

sql
-- Find products that were ordered by customer 101
SELECT product_id, product_name
FROM products p
WHERE product_id IN (
  SELECT product_id
  FROM orders o
  WHERE o.customer_id = 101
);

Performance Considerations

Correlated subqueries run once per outer row, which can be slow on large datasets.

Performance impact: If the outer query returns 10,000 rows, the correlated subquery runs 10,000 times! For large tables, this can be very slow.

Optimization strategies:

1. Use JOINs instead (when possible):

sql
-- Correlated subquery (slower)
SELECT name, salary
FROM employees e1
WHERE salary > (
  SELECT AVG(salary)
  FROM employees e2
  WHERE e2.department = e1.department
);

-- JOIN with GROUP BY (faster)
SELECT e.name, e.salary
FROM employees e
JOIN (
  SELECT department, AVG(salary) AS avg_salary
  FROM employees
  GROUP BY department
) dept_avg ON e.department = dept_avg.department
WHERE e.salary > dept_avg.avg_salary;

The JOIN version calculates each department's average once, then joins. Much faster!

2. Use window functions (covered later):

sql
-- Even better: window functions
SELECT name, salary
FROM (
  SELECT
    name,
    salary,
    department,
    AVG(salary) OVER (PARTITION BY department) AS dept_avg
  FROM employees
) subquery
WHERE salary > dept_avg;

Window functions calculate the average in a single pass - the most efficient approach!

3. Add indexes on correlated columns:

If you must use correlated subqueries, index the columns used in the correlation:

sql
-- Index the department column for faster correlation
CREATE INDEX idx_employees_department ON employees(department);

Best practice: Use correlated subqueries for readability when working with small datasets. For large datasets or production systems, prefer JOINs or window functions.

Common Mistakes to Avoid

Don't make these mistakes:

  1. Forgetting table aliases: ❌

    • Wrong: WHERE department = department
    • This compares the column to itself, not outer to inner
    • Fix: Use aliases: WHERE e2.department = e1.department
  2. Using aggregate without GROUP BY in subquery: Usually OK

    • Correlated subqueries with aggregates don't need GROUP BY
    • The correlation (WHERE clause) effectively groups the data
  3. Confusing EXISTS with IN: Different uses

    • EXISTS checks if any rows match (boolean)
    • IN checks if a value is in a set
    • EXISTS is often faster for correlated checks
  4. Over-using correlated subqueries: ❌

    • They're powerful but can be slow
    • Check if a JOIN or window function works instead
  5. Not testing on large datasets: ❌

    • Correlated subquery works fine on 100 rows
    • Becomes unusably slow on 1,000,000 rows
    • Always test performance!
  6. Expecting subquery to return multiple columns: ❌

    • Wrong: WHERE (col1, col2) = (SELECT col1, col2 FROM ...)
    • In most contexts, subquery must return a single value
    • Use EXISTS or IN for multi-column correlations
  7. Missing the correlation condition: ❌

    • Forgetting WHERE e2.dept = e1.dept makes it uncorrelated
    • Results will be wrong!

Practical Examples

Example 1: Find products that have never been ordered

sql
SELECT product_id, product_name
FROM products p
WHERE NOT EXISTS (
  SELECT 1
  FROM orders o
  WHERE o.product_id = p.product_id
);

Example 2: Find employees hired before anyone in their department

employees_with_dates
id
name
department
hire_date
1AliceEngineering2020-01-15
2BobEngineering2021-03-20
3CharlieSales2019-06-10
4DianaSales2022-02-01
5EveMarketing2020-11-05
sql
-- Find the most senior employee in each department
SELECT name, department, hire_date
FROM employees_with_dates e1
WHERE NOT EXISTS (
  SELECT 1
  FROM employees_with_dates e2
  WHERE e2.department = e1.department
    AND e2.hire_date < e1.hire_date
);

For each employee, we check: "Is there anyone in my department hired earlier?" If no, they're the most senior!

Example 3: Customers with above-average order frequency

sql
-- Customers who order more frequently than average
SELECT
  c.customer_id,
  c.name,
  (SELECT COUNT(*) FROM orders o WHERE o.customer_id = c.customer_id) AS order_count
FROM customers c
WHERE (
  SELECT COUNT(*)
  FROM orders o
  WHERE o.customer_id = c.customer_id
) > (
  SELECT AVG(order_count)
  FROM (
    SELECT COUNT(*) AS order_count
    FROM orders
    GROUP BY customer_id
  ) AS counts
);

Alice has 2 orders, Bob and Charlie have 1 each. Average = (2+1+1)/3 = 1.33. Alice is above average!

Key Takeaways

What you learned:

Correlated subqueries reference columns from the outer query ✅ They execute once per row of the outer query ✅ EXISTS and NOT EXISTS are common with correlated subqueries ✅ Use table aliases to distinguish outer and inner queries ✅ Correlated subqueries enable row-by-row comparisons (e.g., comparing to group averages) ✅ They're powerful but can be slow on large datasetsJOINs and window functions are often faster alternatives ✅ Common patterns: above group average, top N per group, finding duplicates, checking existence

When to use correlated subqueries:

  • Comparing each row to its group's statistics
  • Checking for existence (EXISTS/NOT EXISTS)
  • Finding top N per category
  • Detecting duplicates
  • When readability matters more than performance (small datasets)

When to avoid:

  • Large datasets (millions of rows)
  • When a JOIN or window function can do the job
  • Production queries needing optimal performance

Coming up next: You'll learn about set operations (UNION, INTERSECT, EXCEPT) to combine and compare entire result sets!

Practice Exercise: Try these queries:

  1. Find products more expensive than their category average
  2. List customers who have placed more orders than the average customer
  3. Find employees hired before anyone else in their department
  4. Identify customers who have ordered every product
  5. Find products that have been ordered but never returned