Kekkei - Advancing Education For Everyone

Introduction to PARTITION BY

You've learned about window functions with the OVER() clause. Now let's dive deep into PARTITION BY - the most powerful aspect of window functions that lets you divide your data into groups and perform calculations within each group without losing individual rows.

Think of PARTITION BY like GROUP BY's sophisticated cousin:

GROUP BY: Collapses rows into groups, returns one row per group
PARTITION BY: Divides rows into groups, keeps all rows, adds calculated values

Real-world scenario: "Show each employee with their salary and their department's average salary side-by-side."

With GROUP BY, you'd lose individual employee rows. With PARTITION BY, you keep every employee row and add the department average as a new column!

A clause in window functions that divides the result set into partitions (groups) and applies the window function separately to each partition, while preserving all individual rows.

GROUP BY vs PARTITION BY

Feature	GROUP BY	PARTITION BY
Purpose	Aggregate rows into groups	Calculate within groups
Result	One row per group	All rows preserved
Use Case	"Total sales by region"	"Each sale with regional average"
Combines	Must use aggregate functions	Works with any window function

Understanding the Difference

Let's start with sample data to see the difference clearly:

employees

id	name	department	salary
1	Alice	Engineering	95000
2	Bob	Engineering	85000
3	Charlie	Sales	70000
4	Diana	Sales	65000
5	Eve	Marketing	75000
6	Frank	Marketing	80000

With GROUP BY (collapses rows):

sql

-- Average salary per department (one row per department)
SELECT
  department,
  AVG(salary) AS avg_salary
FROM employees
GROUP BY department;

We get 3 rows (one per department), but we lost individual employee data!

With PARTITION BY (keeps all rows):

sql

-- Each employee with their department's average salary
SELECT
  name,
  department,
  salary,
  AVG(salary) OVER (PARTITION BY department) AS dept_avg_salary
FROM employees
ORDER BY department, salary DESC;

We get all 6 employees plus each department's average salary! Each employee can see how their salary compares to their department's average.

This is the magic of PARTITION BY - you get group-level statistics on individual row-level data.

Key difference:

GROUP BY: 6 rows → 3 rows (collapsed)
PARTITION BY: 6 rows → 6 rows (preserved) with added calculation

PARTITION BY Syntax

Basic structure:

sql

SELECT
  column1,
  column2,
  FUNCTION() OVER (PARTITION BY partition_column) AS result_name
FROM table_name;

With multiple partition columns:

sql

SELECT
  column1,
  FUNCTION() OVER (PARTITION BY col1, col2 ORDER BY col3) AS result
FROM table_name;

Components:

PARTITION BY column: Divides data into groups based on column values
ORDER BY: (Optional) Orders rows within each partition
FUNCTION(): Any window function (AVG, SUM, ROW_NUMBER, RANK, etc.)

Aggregate Functions with PARTITION BY

All aggregate functions work with PARTITION BY: SUM, AVG, COUNT, MIN, MAX.

sales

sale_id	salesperson	region	amount	sale_date
1	Alice	North	5000	2024-01-15
2	Bob	North	7000	2024-01-20
3	Charlie	South	6000	2024-01-18
4	Alice	North	4500	2024-02-01
5	Diana	South	5500	2024-02-05
6	Bob	North	8000	2024-02-10

Example 1: SUM - Total sales by region

sql

SELECT
  sale_id,
  salesperson,
  region,
  amount,
  SUM(amount) OVER (PARTITION BY region) AS regional_total
FROM sales
ORDER BY region, sale_date;

Each sale shows the total for its region. North: 5000+7000+4500+8000 = 24,500. South: 6000+5500 = 11,500.

Example 2: COUNT - Number of sales per region

sql

SELECT
  sale_id,
  salesperson,
  region,
  amount,
  COUNT(*) OVER (PARTITION BY region) AS sales_count
FROM sales
ORDER BY region, sale_date;

North has 4 sales, South has 2.

Example 3: MIN and MAX - Range within partition

sql

SELECT
  salesperson,
  region,
  amount,
  MIN(amount) OVER (PARTITION BY region) AS min_regional_sale,
  MAX(amount) OVER (PARTITION BY region) AS max_regional_sale,
  amount - MIN(amount) OVER (PARTITION BY region) AS above_minimum
FROM sales
ORDER BY region, amount DESC;

Each sale shows its region's minimum and maximum, plus how much it exceeds the minimum.

Ranking Functions with PARTITION BY

Ranking functions (ROW_NUMBER, RANK, DENSE_RANK) are incredibly useful with PARTITION BY.

ROW_NUMBER - Sequential numbering within partition:

sql

-- Rank salespeople within each region by sales amount
SELECT
  salesperson,
  region,
  amount,
  ROW_NUMBER() OVER (PARTITION BY region ORDER BY amount DESC) AS regional_rank
FROM sales
ORDER BY region, regional_rank;

Ranking restarts at 1 for each region! Bob's 8000 is #1 in North, Charlie's 6000 is #1 in South.

RANK - With gap for ties:

products

product_id	category	product_name	price
1	Electronics	Laptop	999.99
2	Electronics	Monitor	349.99
3	Electronics	Keyboard	79.99
4	Furniture	Desk	299.99
5	Furniture	Chair	199.99
6	Furniture	Lamp	199.99

sql

-- Rank products by price within category
SELECT
  category,
  product_name,
  price,
  RANK() OVER (PARTITION BY category ORDER BY price DESC) AS price_rank
FROM products;

Chair and Lamp are tied at rank 2 (both $199.99). Notice there's no rank 3 in Furniture - RANK leaves gaps after ties.

Finding Top N per Group

One of the most common use cases: "Find the top 3 products in each category" or "Top 2 salespeople per region".

sql

-- Top 2 sales per region
SELECT *
FROM (
  SELECT
    sale_id,
    salesperson,
    region,
    amount,
    ROW_NUMBER() OVER (PARTITION BY region ORDER BY amount DESC) AS rank_in_region
  FROM sales
) ranked
WHERE rank_in_region <= 2
ORDER BY region, rank_in_region;

We get the top 2 sales from each region! This would be extremely complex with GROUP BY alone.

Top 2 most expensive products per category:

sql

SELECT category, product_name, price
FROM (
  SELECT
    category,
    product_name,
    price,
    ROW_NUMBER() OVER (PARTITION BY category ORDER BY price DESC) AS rank
  FROM products
) ranked
WHERE rank <= 2;

Pattern for "Top N per group":

Use ROW_NUMBER() OVER (PARTITION BY group_column ORDER BY sort_column)
Wrap in subquery
Filter WHERE rank <= N

Multiple Partitions in One Query

You can use multiple window functions with different partitions:

sql

SELECT
  salesperson,
  region,
  amount,
  -- Regional statistics
  AVG(amount) OVER (PARTITION BY region) AS regional_avg,
  RANK() OVER (PARTITION BY region ORDER BY amount DESC) AS regional_rank,
  -- Salesperson statistics (across all regions)
  SUM(amount) OVER (PARTITION BY salesperson) AS salesperson_total,
  -- Overall statistics (no partition)
  AVG(amount) OVER () AS overall_avg
FROM sales
ORDER BY region, amount DESC;

This single query calculates:

Regional average and rank
Each salesperson's total (all their sales combined)
Overall company average

Each window function can have its own PARTITION BY!

PARTITION BY with ORDER BY

Adding ORDER BY to the OVER clause changes behavior - it creates a "running" or "cumulative" calculation.

Without ORDER BY (calculates over entire partition):

sql

-- Total sales per region (all rows in partition)
SELECT
  sale_id,
  region,
  amount,
  SUM(amount) OVER (PARTITION BY region) AS regional_total
FROM sales
ORDER BY region, sale_id;

Every North row shows 24,500 (total of all North sales).

With ORDER BY (running total within partition):

sql

-- Running total per region (cumulative)
SELECT
  sale_id,
  region,
  sale_date,
  amount,
  SUM(amount) OVER (PARTITION BY region ORDER BY sale_date) AS running_total
FROM sales
ORDER BY region, sale_date;

Now it's a running total! Each row shows the sum of all sales in that region up to that date.

For North:

Jan 15: 5000 (first sale)
Jan 20: 12000 (5000 + 7000)
Feb 01: 16500 (5000 + 7000 + 4500)
Feb 10: 24500 (5000 + 7000 + 4500 + 8000)

Key difference:

No ORDER BY: Function applies to entire partition
With ORDER BY: Function applies to rows from start of partition up to current row (running calculation)

Practical Business Examples

Example 1: Employee comparison to department average

sql

SELECT
  name,
  department,
  salary,
  ROUND(AVG(salary) OVER (PARTITION BY department), 2) AS dept_avg,
  ROUND(salary - AVG(salary) OVER (PARTITION BY department), 2) AS diff_from_avg,
  CASE
    WHEN salary > AVG(salary) OVER (PARTITION BY department)
    THEN 'Above Average'
    ELSE 'Below Average'
  END AS performance_level
FROM employees
ORDER BY department, salary DESC;

Example 2: Market share analysis

sql

SELECT
  salesperson,
  region,
  amount,
  SUM(amount) OVER (PARTITION BY region) AS regional_total,
  ROUND(amount * 100.0 / SUM(amount) OVER (PARTITION BY region), 2) AS regional_share_pct,
  SUM(amount) OVER () AS company_total,
  ROUND(amount * 100.0 / SUM(amount) OVER (), 2) AS company_share_pct
FROM sales
ORDER BY region, amount DESC;

Bob's 8000 sale represents 32.65% of North region sales and 22.22% of company sales.

Example 3: Moving average per product category

sql

SELECT
  category,
  product_name,
  price,
  AVG(price) OVER (PARTITION BY category) AS category_avg_price,
  price / AVG(price) OVER (PARTITION BY category) AS price_index
FROM products
ORDER BY category, price DESC;

The Laptop costs 2.10x the Electronics category average. The Keyboard is only 0.17x (17% of) the average.

Common Mistakes to Avoid

Don't make these mistakes:

Forgetting ORDER BY for ranking functions: ❌
- Wrong: ROW_NUMBER() OVER (PARTITION BY dept)
- ROW_NUMBER needs ORDER BY to know which order to number rows
- Fix: ROW_NUMBER() OVER (PARTITION BY dept ORDER BY salary DESC)
Using GROUP BY when you mean PARTITION BY: ❌
- GROUP BY collapses rows, PARTITION BY preserves them
- If you want individual rows with group stats, use PARTITION BY
Mixing aggregate without window function: ❌
- Wrong: SELECT name, salary, AVG(salary) FROM employees
- Either GROUP BY or use OVER clause
- Fix: SELECT name, salary, AVG(salary) OVER () FROM employees
Not understanding ORDER BY's effect: ⚠️
- ORDER BY in OVER changes behavior to running calculations
- Without ORDER BY: entire partition
- With ORDER BY: running/cumulative
Assuming partitions are independent: They are!
- Calculations in one partition don't affect others
- Ranks restart at 1 in each partition
Over-partitioning: ❌
- Partitioning by unique column (like ID) is pointless
- Each partition would have only 1 row
- Partition by meaningful groups (department, region, category)
Not aliasing window functions: Makes queries unreadable
- Always use AS to name your calculated columns
- AVG(salary) OVER (...) AS dept_avg_salary
Forgetting PARTITION BY creates running total with ORDER BY: ⚠️
- SUM(amount) OVER (ORDER BY date) is a running total
- Add PARTITION BY if you want running totals per group

Performance Considerations

PARTITION BY performance tips:

1. Index the partition columns:

sql

-- If you frequently partition by department and order by salary
CREATE INDEX idx_dept_salary ON employees(department, salary);

2. Limit the result set before applying window functions:

sql

-- Good: Filter first, then apply window function
SELECT name, department, salary,
       AVG(salary) OVER (PARTITION BY department) AS dept_avg
FROM employees
WHERE hire_date >= '2023-01-01';

-- Less efficient: Window function on all rows, then filter
SELECT * FROM (
  SELECT name, department, salary,
         AVG(salary) OVER (PARTITION BY department) AS dept_avg
  FROM employees
) subquery
WHERE hire_date >= '2023-01-01';

3. Window functions vs self-joins:

Window functions are usually faster than self-joins for analytical queries:

sql

-- Slow: Self-join to get department average
SELECT e1.name, e1.salary, dept_avg.avg_salary
FROM employees e1
JOIN (
  SELECT department, AVG(salary) AS avg_salary
  FROM employees
  GROUP BY department
) dept_avg ON e1.department = dept_avg.department;

-- Fast: Window function
SELECT name, salary,
       AVG(salary) OVER (PARTITION BY department) AS avg_salary
FROM employees;

Best practice: Window functions are optimized for analytical queries. They're almost always faster and more readable than equivalent self-joins.

Key Takeaways

What you learned:

✅ PARTITION BY divides data into groups for window function calculations ✅ Unlike GROUP BY, PARTITION BY preserves all rows ✅ Window functions calculate within each partition separately ✅ Combine with aggregate functions (SUM, AVG, COUNT, MIN, MAX) ✅ Combine with ranking functions (ROW_NUMBER, RANK, DENSE_RANK) ✅ ORDER BY in OVER clause creates running/cumulative calculations ✅ Multiple window functions can use different partitions in same query ✅ Perfect for "top N per group" queries ✅ More efficient than self-joins for analytical queries ✅ Essential for business analytics and reporting

Common patterns:

Compare individual rows to group averages
Rank items within categories
Calculate running totals per group
Find top N per group
Calculate market share percentages
Identify above/below average performers

Coming up next: You'll learn about ACID properties - the fundamental principles that ensure database reliability and data integrity!

Practice Exercise: Try these queries:

Show each employee with their department's total salary budget
Rank products by price within each category
Calculate running total of sales per region ordered by date
Find the top 3 highest-paid employees in each department
Show each sale with its percentage of regional total

Introduction to PARTITION BYFocusStart Focus Mode

Understanding the DifferenceFocusStart Focus Mode

PARTITION BY SyntaxFocusStart Focus Mode

Aggregate Functions with PARTITION BYFocusStart Focus Mode

Ranking Functions with PARTITION BYFocusStart Focus Mode

Finding Top N per GroupFocusStart Focus Mode

Multiple Partitions in One QueryFocusStart Focus Mode

PARTITION BY with ORDER BYFocusStart Focus Mode

Practical Business ExamplesFocusStart Focus Mode

Common Mistakes to AvoidFocusStart Focus Mode

Performance ConsiderationsFocusStart Focus Mode

Key TakeawaysFocusStart Focus Mode