Kekkei - Advancing Education For Everyone

Introduction to Correlated Subqueries

You've learned about regular subqueries - they run once and return a result that the outer query uses. But what if you need the subquery to run for each row of the outer query, using values from that row?

That's a correlated subquery - it references columns from the outer query, creating a row-by-row dependency.

Real-world scenario: "Find employees who earn more than the average salary in their department."

You can't use a simple subquery because the average is different for each department. You need a subquery that references the current employee's department - that's correlation!

A subquery that references columns from the outer query. It's executed once for each row processed by the outer query, using values from that specific row.

Regular vs Correlated Subqueries

Feature	Regular Subquery	Correlated Subquery
Execution	Runs once, independently	Runs for each outer row
References	No outer query columns	References outer query columns
Performance	Usually faster	Can be slower (row-by-row)
Use Case	Fixed comparisons	Row-dependent comparisons

Understanding the Difference

Let's start with sample data:

employees

id	name	department	salary
1	Alice	Engineering	85000
2	Bob	Engineering	95000
3	Charlie	Sales	70000
4	Diana	Sales	65000
5	Eve	Marketing	75000
6	Frank	Marketing	80000

Regular subquery (runs once):

sql

-- Find employees earning more than company average
SELECT name, salary
FROM employees
WHERE salary > (SELECT AVG(salary) FROM employees);

The subquery (SELECT AVG(salary) FROM employees) runs once, calculates 78,333.33, then the outer query compares each salary to that fixed value.

Correlated subquery (runs for each row):

sql

-- Find employees earning more than their department's average
SELECT name, department, salary
FROM employees e1
WHERE salary > (
  SELECT AVG(salary)
  FROM employees e2
  WHERE e2.department = e1.department
);

Notice e2.department = e1.department - the subquery references the outer query's e1.department. For each employee:

Alice (Engineering): Subquery calculates AVG for Engineering (90,000). 85,000 < 90,000 → NOT included
Bob (Engineering): AVG for Engineering (90,000). 95,000 > 90,000 → INCLUDED
Charlie (Sales): AVG for Sales (67,500). 70,000 > 67,500 → INCLUDED
Diana (Sales): AVG for Sales (67,500). 65,000 < 67,500 → NOT included
Eve (Marketing): AVG for Marketing (77,500). 75,000 < 77,500 → NOT included
Frank (Marketing): AVG for Marketing (77,500). 80,000 > 77,500 → INCLUDED

The subquery runs 6 times, once per employee, each time with a different department.

Key indicator: If you see the subquery using a column from the outer query (like e1.department), it's correlated!

Correlated Subquery Syntax

Structure:

sql

SELECT column1, column2
FROM table1 outer_alias
WHERE column > (
  SELECT aggregate_function(column)
  FROM table2 inner_alias
  WHERE inner_alias.column = outer_alias.column
);

Key components:

Outer query alias (outer_alias) - identifies rows from the outer query
Inner query alias (inner_alias) - identifies rows in the subquery
Correlation condition - inner_alias.column = outer_alias.column
The subquery references the outer alias

Important: You must use table aliases (e1, e2, etc.) to distinguish outer and inner references!

EXISTS with Correlated Subqueries

EXISTS checks if a correlated subquery returns any rows. It's extremely common with correlated subqueries.

orders

order_id	customer_id	product_id	quantity	order_date
1	101	1	2	2024-01-15
2	102	2	1	2024-01-20
3	101	3	5	2024-02-01
4	103	1	1	2024-02-05

customers

customer_id	name	email
101	Alice	alice@example.com
102	Bob	bob@example.com
103	Charlie	charlie@example.com
104	Diana	diana@example.com

Find customers who have placed orders:

sql

SELECT customer_id, name
FROM customers c
WHERE EXISTS (
  SELECT 1
  FROM orders o
  WHERE o.customer_id = c.customer_id
);

For each customer, the subquery checks: "Are there any orders with this customer_id?" If yes, include the customer. Diana (104) has no orders, so she's excluded.

Find customers who have NOT placed orders:

sql

SELECT customer_id, name
FROM customers c
WHERE NOT EXISTS (
  SELECT 1
  FROM orders o
  WHERE o.customer_id = c.customer_id
);

NOT EXISTS is the opposite - it returns rows where the subquery finds no matches.

Pro tip: In EXISTS subqueries, SELECT 1 is conventional. The actual value doesn't matter - EXISTS only cares whether rows are returned, not what data they contain.

Common Patterns with Correlated Subqueries

Pattern 1: Find items above group average

products

product_id	product_name	category	price
1	Laptop	Electronics	999.99
2	Mouse	Electronics	29.99
3	Desk Chair	Furniture	199.99
4	Monitor	Electronics	349.99
5	Desk Lamp	Furniture	39.99
6	Keyboard	Electronics	79.99

sql

-- Products more expensive than their category average
SELECT product_name, category, price
FROM products p1
WHERE price > (
  SELECT AVG(price)
  FROM products p2
  WHERE p2.category = p1.category
);

Pattern 2: Find top N per group

sql

-- Employees who are in the top 2 highest paid in their department
SELECT name, department, salary
FROM employees e1
WHERE (
  SELECT COUNT(*)
  FROM employees e2
  WHERE e2.department = e1.department
    AND e2.salary > e1.salary
) < 2
ORDER BY department, salary DESC;

For each employee, this counts how many employees in the same department earn more. If fewer than 2 earn more, they're in the top 2!

Pattern 3: Find duplicates

orders_raw

order_id	customer_id	order_date	total
1	101	2024-02-01	150.00
2	102	2024-02-01	200.00
3	101	2024-02-01	150.00
4	103	2024-02-02	100.00

sql

-- Find duplicate orders (same customer, date, and total)
SELECT *
FROM orders_raw o1
WHERE EXISTS (
  SELECT 1
  FROM orders_raw o2
  WHERE o2.customer_id = o1.customer_id
    AND o2.order_date = o1.order_date
    AND o2.total = o1.total
    AND o2.order_id != o1.order_id
);

Orders 1 and 3 have identical customer, date, and total values - likely duplicates!

o2.order_id != o1.order_id ensures we don't match a row with itself.

Scalar Correlated Subqueries

Correlated subqueries can appear in the SELECT clause, returning a value for each row:

sql

-- Show each employee with their department's average salary
SELECT
  name,
  department,
  salary,
  (SELECT AVG(salary)
   FROM employees e2
   WHERE e2.department = e1.department) AS dept_avg_salary,
  salary - (SELECT AVG(salary)
            FROM employees e2
            WHERE e2.department = e1.department) AS difference_from_avg
FROM employees e1
ORDER BY department, salary DESC;

Each row calculates its department's average on-the-fly. This makes it easy to see who's above or below their department average!

Correlated Subqueries with IN

You can use IN with correlated subqueries for complex filtering:

sql

-- Find customers who ordered product 1
SELECT customer_id, name
FROM customers c
WHERE customer_id IN (
  SELECT customer_id
  FROM orders
  WHERE product_id = 1
);

This isn't technically correlated (the subquery doesn't reference c), but let's make it correlated:

sql

-- Find products that were ordered by customer 101
SELECT product_id, product_name
FROM products p
WHERE product_id IN (
  SELECT product_id
  FROM orders o
  WHERE o.customer_id = 101
);

Performance Considerations

Correlated subqueries run once per outer row, which can be slow on large datasets.

Performance impact: If the outer query returns 10,000 rows, the correlated subquery runs 10,000 times! For large tables, this can be very slow.

Optimization strategies:

1. Use JOINs instead (when possible):

sql

-- Correlated subquery (slower)
SELECT name, salary
FROM employees e1
WHERE salary > (
  SELECT AVG(salary)
  FROM employees e2
  WHERE e2.department = e1.department
);

-- JOIN with GROUP BY (faster)
SELECT e.name, e.salary
FROM employees e
JOIN (
  SELECT department, AVG(salary) AS avg_salary
  FROM employees
  GROUP BY department
) dept_avg ON e.department = dept_avg.department
WHERE e.salary > dept_avg.avg_salary;

The JOIN version calculates each department's average once, then joins. Much faster!

2. Use window functions (covered later):

sql

-- Even better: window functions
SELECT name, salary
FROM (
  SELECT
    name,
    salary,
    department,
    AVG(salary) OVER (PARTITION BY department) AS dept_avg
  FROM employees
) subquery
WHERE salary > dept_avg;

Window functions calculate the average in a single pass - the most efficient approach!

3. Add indexes on correlated columns:

If you must use correlated subqueries, index the columns used in the correlation:

sql

-- Index the department column for faster correlation
CREATE INDEX idx_employees_department ON employees(department);

Best practice: Use correlated subqueries for readability when working with small datasets. For large datasets or production systems, prefer JOINs or window functions.

Common Mistakes to Avoid

Don't make these mistakes:

Forgetting table aliases: ❌
- Wrong: WHERE department = department
- This compares the column to itself, not outer to inner
- Fix: Use aliases: WHERE e2.department = e1.department
Using aggregate without GROUP BY in subquery: Usually OK
- Correlated subqueries with aggregates don't need GROUP BY
- The correlation (WHERE clause) effectively groups the data
Confusing EXISTS with IN: Different uses
- EXISTS checks if any rows match (boolean)
- IN checks if a value is in a set
- EXISTS is often faster for correlated checks
Over-using correlated subqueries: ❌
- They're powerful but can be slow
- Check if a JOIN or window function works instead
Not testing on large datasets: ❌
- Correlated subquery works fine on 100 rows
- Becomes unusably slow on 1,000,000 rows
- Always test performance!
Expecting subquery to return multiple columns: ❌
- Wrong: WHERE (col1, col2) = (SELECT col1, col2 FROM ...)
- In most contexts, subquery must return a single value
- Use EXISTS or IN for multi-column correlations
Missing the correlation condition: ❌
- Forgetting WHERE e2.dept = e1.dept makes it uncorrelated
- Results will be wrong!

Practical Examples

Example 1: Find products that have never been ordered

sql

SELECT product_id, product_name
FROM products p
WHERE NOT EXISTS (
  SELECT 1
  FROM orders o
  WHERE o.product_id = p.product_id
);

Example 2: Find employees hired before anyone in their department

employees_with_dates

id	name	department	hire_date
1	Alice	Engineering	2020-01-15
2	Bob	Engineering	2021-03-20
3	Charlie	Sales	2019-06-10
4	Diana	Sales	2022-02-01
5	Eve	Marketing	2020-11-05

sql

-- Find the most senior employee in each department
SELECT name, department, hire_date
FROM employees_with_dates e1
WHERE NOT EXISTS (
  SELECT 1
  FROM employees_with_dates e2
  WHERE e2.department = e1.department
    AND e2.hire_date < e1.hire_date
);

For each employee, we check: "Is there anyone in my department hired earlier?" If no, they're the most senior!

Example 3: Customers with above-average order frequency

sql

-- Customers who order more frequently than average
SELECT
  c.customer_id,
  c.name,
  (SELECT COUNT(*) FROM orders o WHERE o.customer_id = c.customer_id) AS order_count
FROM customers c
WHERE (
  SELECT COUNT(*)
  FROM orders o
  WHERE o.customer_id = c.customer_id
) > (
  SELECT AVG(order_count)
  FROM (
    SELECT COUNT(*) AS order_count
    FROM orders
    GROUP BY customer_id
  ) AS counts
);

Alice has 2 orders, Bob and Charlie have 1 each. Average = (2+1+1)/3 = 1.33. Alice is above average!

Key Takeaways

What you learned:

✅ Correlated subqueries reference columns from the outer query ✅ They execute once per row of the outer query ✅ EXISTS and NOT EXISTS are common with correlated subqueries ✅ Use table aliases to distinguish outer and inner queries ✅ Correlated subqueries enable row-by-row comparisons (e.g., comparing to group averages) ✅ They're powerful but can be slow on large datasets ✅ JOINs and window functions are often faster alternatives ✅ Common patterns: above group average, top N per group, finding duplicates, checking existence

When to use correlated subqueries:

Comparing each row to its group's statistics
Checking for existence (EXISTS/NOT EXISTS)
Finding top N per category
Detecting duplicates
When readability matters more than performance (small datasets)

When to avoid:

Large datasets (millions of rows)
When a JOIN or window function can do the job
Production queries needing optimal performance

Coming up next: You'll learn about set operations (UNION, INTERSECT, EXCEPT) to combine and compare entire result sets!

Practice Exercise: Try these queries:

Find products more expensive than their category average
List customers who have placed more orders than the average customer
Find employees hired before anyone else in their department
Identify customers who have ordered every product
Find products that have been ordered but never returned

Introduction to Correlated SubqueriesFocusStart Focus Mode

Understanding the DifferenceFocusStart Focus Mode

Correlated Subquery SyntaxFocusStart Focus Mode

EXISTS with Correlated SubqueriesFocusStart Focus Mode

Common Patterns with Correlated SubqueriesFocusStart Focus Mode

Scalar Correlated SubqueriesFocusStart Focus Mode

Correlated Subqueries with INFocusStart Focus Mode

Performance ConsiderationsFocusStart Focus Mode

Common Mistakes to AvoidFocusStart Focus Mode

Practical ExamplesFocusStart Focus Mode

Key TakeawaysFocusStart Focus Mode