Kekkei - Advancing Education For Everyone

Window Functions

Window functions are one of the most powerful features in SQL for performing calculations across sets of rows that are related to the current row. Unlike aggregate functions that collapse multiple rows into one, window functions preserve the individual rows while adding calculated values.

Think of window functions as looking through a "window" at a subset of your data while processing each row. You can rank rows, calculate running totals, find moving averages, compare values across rows, and much more - all without losing the detail of individual records.

Why Use Window Functions?

Analytics without grouping: Calculate aggregates while keeping all rows visible
Ranking and numbering: Assign ranks, row numbers, or dense ranks to rows
Comparisons across rows: Compare current row with previous/next rows
Running calculations: Calculate running totals, moving averages, cumulative sums
Partitioning: Perform calculations within subgroups without GROUP BY
Performance: More efficient than self-joins for many analytical queries

Window functions are essential for data analysis, reporting, and business intelligence. They're supported by all modern databases (MySQL 8.0+, PostgreSQL, SQL Server, Oracle, SQLite 3.25+).

Window functions were introduced in MySQL 8.0. If you're using MySQL 5.7 or earlier, you'll need to use self-joins or variables to achieve similar results, which is much more complex.

Basic Window Function Syntax

Window functions use the OVER() clause to define the "window" of rows to operate on. The basic syntax is:

function_name() OVER (
  [PARTITION BY column1, column2, ...]
  [ORDER BY column3, column4, ...]
  [ROWS or RANGE frame_specification]
)

Key components:

Function: The window function (ROW_NUMBER, RANK, SUM, AVG, etc.)
PARTITION BY: Divides data into groups (optional)
ORDER BY: Defines the order of rows within each partition (required for many functions)
Frame clause: Defines which rows are included in the window (advanced, optional)

A function that performs calculations across a set of rows related to the current row, using the OVER() clause to define the window of rows to include.

ROW_NUMBER() - Sequential Numbering

ROW_NUMBER() assigns a unique sequential number to each row within a partition, starting from 1. It's one of the most commonly used window functions.

sql

-- Number all employees, ordered by salary (highest first)
SELECT
  employee_id,
  name,
  department,
  salary,
  ROW_NUMBER() OVER (ORDER BY salary DESC) AS overall_rank
FROM employees
ORDER BY salary DESC;

Now let's use PARTITION BY to number employees within each department:

sql

-- Number employees within each department by salary
SELECT
  employee_id,
  name,
  department,
  salary,
  ROW_NUMBER() OVER (
    PARTITION BY department
    ORDER BY salary DESC
  ) AS dept_rank
FROM employees
ORDER BY department, dept_rank;

Notice how PARTITION BY department restarts the numbering at 1 for each department. This is incredibly useful for finding the top N in each group.

Key points:

ROW_NUMBER() always assigns unique numbers, even for ties
PARTITION BY divides data into groups and restarts numbering for each group
ORDER BY determines which row gets which number
Without PARTITION BY, the entire result set is one window

Use ROW_NUMBER() when you need to uniquely identify each row, even if there are ties. For example, "get exactly the top 3 salaries per department" where ties should be broken arbitrarily.

RANK() and DENSE_RANK() - Handling Ties

While ROW_NUMBER() always assigns unique numbers, RANK() and DENSE_RANK() handle tied values differently:

RANK(): Assigns the same rank to ties, then skips numbers
DENSE_RANK(): Assigns the same rank to ties, but doesn't skip numbers
ROW_NUMBER(): Never assigns the same number, even for ties

Let's see them in action:

sql

-- Compare ROW_NUMBER, RANK, and DENSE_RANK
SELECT
  name,
  department,
  salary,
  ROW_NUMBER() OVER (ORDER BY salary DESC) AS row_num,
  RANK() OVER (ORDER BY salary DESC) AS rank_num,
  DENSE_RANK() OVER (ORDER BY salary DESC) AS dense_rank_num
FROM employees
ORDER BY salary DESC, name;

Notice the difference for employees with $58,000 salary (Eve Martinez and Helen White):

ROW_NUMBER: Assigns 7 and 8 (unique numbers, arbitrary tie-breaking)
RANK: Both get 7, then next rank is 9 (skips 8)
DENSE_RANK: Both get 7, then next rank is 8 (no gaps)

Here's a practical example - finding the top 3 paid employees in each department:

sql

-- Find top 3 highest paid employees per department
WITH ranked_employees AS (
  SELECT
    employee_id,
    name,
    department,
    salary,
    DENSE_RANK() OVER (
      PARTITION BY department
      ORDER BY salary DESC
    ) AS salary_rank
  FROM employees
)
SELECT
  employee_id,
  name,
  department,
  salary,
  salary_rank
FROM ranked_employees
WHERE salary_rank <= 3
ORDER BY department, salary_rank, name;

Using DENSE_RANK() ensures that if there are ties for rank 2, we still get rank 3 results. With RANK(), tied rank 2 values would make the next rank 4, potentially missing our "top 3" goal.

Function	Handles Ties	Gaps in Sequence	Use Case
ROW_NUMBER()	Assigns unique numbers	No gaps	Need unique IDs, pagination
RANK()	Same rank for ties	Gaps after ties	Olympic-style ranking (1, 2, 2, 4)
DENSE_RANK()	Same rank for ties	No gaps	Find "top N distinct values"

When filtering on rank (e.g., WHERE rank <= 3), you MUST use a CTE or subquery. You can't filter on window functions directly in a WHERE clause because window functions are calculated after WHERE.

Aggregate Window Functions

You can use aggregate functions (SUM, AVG, COUNT, MIN, MAX) as window functions to calculate running totals, moving averages, and cumulative statistics while preserving individual rows.

sql

-- Calculate running total and moving average of sales
SELECT
  order_date,
  daily_revenue,
  SUM(daily_revenue) OVER (
    ORDER BY order_date
  ) AS running_total,
  AVG(daily_revenue) OVER (
    ORDER BY order_date
    ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
  ) AS moving_avg_3day,
  COUNT(*) OVER (ORDER BY order_date) AS days_count
FROM (
  SELECT
    DATE(order_date) AS order_date,
    SUM(total_amount) AS daily_revenue
  FROM orders
  WHERE order_date >= '2026-01-01' AND order_date <= '2026-01-07'
  GROUP BY DATE(order_date)
) daily_sales
ORDER BY order_date;

This query demonstrates:

Running total: SUM() OVER (ORDER BY order_date) adds up all previous rows
Moving average: AVG() OVER (... ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) calculates 3-day average
Row count: COUNT(*) OVER (ORDER BY order_date) counts rows up to current row

The ROWS BETWEEN clause defines the window frame - which specific rows to include:

ROWS BETWEEN 2 PRECEDING AND CURRENT ROW: Current row + 2 rows before it
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW: All rows from start to current (default for ORDER BY)
ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING: Previous row + current + next row

Here's an example with partitioning to calculate department-level statistics:

sql

-- Calculate salary statistics within each department
SELECT
  employee_id,
  name,
  department,
  salary,
  AVG(salary) OVER (PARTITION BY department) AS dept_avg_salary,
  MAX(salary) OVER (PARTITION BY department) AS dept_max_salary,
  MIN(salary) OVER (PARTITION BY department) AS dept_min_salary,
  COUNT(*) OVER (PARTITION BY department) AS dept_employee_count,
  ROUND(salary / AVG(salary) OVER (PARTITION BY department) * 100, 1) AS pct_of_avg
FROM employees
ORDER BY department, salary DESC;

Each row shows individual employee data alongside department-level aggregates. This would be impossible with GROUP BY without losing the individual row details.

Window aggregate functions are perfect for calculating "percentage of total" or "difference from average" because you can see both the individual value and the aggregate in the same row.

LAG() and LEAD() - Access Adjacent Rows

LAG() and LEAD() allow you to access data from previous or following rows without self-joins. This is incredibly useful for calculating changes, growth rates, or comparing consecutive periods.

LAG(column, offset, default): Gets value from previous row(s)
LEAD(column, offset, default): Gets value from next row(s)

sql

-- Calculate month-over-month sales growth
SELECT
  month,
  total_sales,
  LAG(total_sales, 1) OVER (ORDER BY month) AS prev_month_sales,
  total_sales - LAG(total_sales, 1) OVER (ORDER BY month) AS sales_change,
  ROUND(
    100.0 * (total_sales - LAG(total_sales, 1) OVER (ORDER BY month))
    / LAG(total_sales, 1) OVER (ORDER BY month),
    1
  ) AS growth_pct
FROM (
  SELECT
    DATE_FORMAT(order_date, '%Y-%m') AS month,
    SUM(total_amount) AS total_sales
  FROM orders
  WHERE order_date >= '2025-07-01' AND order_date < '2026-02-01'
  GROUP BY DATE_FORMAT(order_date, '%Y-%m')
) monthly_sales
ORDER BY month;

Notice the first row has NULL for previous month (no previous row exists). You can provide a default value as the third parameter: LAG(total_sales, 1, 0).

Here's an example using LEAD() to compare with the next row:

sql

-- Compare each employee's salary with the next higher salary
SELECT
  name,
  department,
  salary,
  LEAD(salary, 1) OVER (
    PARTITION BY department
    ORDER BY salary DESC
  ) AS next_lower_salary,
  salary - LEAD(salary, 1) OVER (
    PARTITION BY department
    ORDER BY salary DESC
  ) AS salary_gap
FROM employees
ORDER BY department, salary DESC;

The offset parameter (second argument) lets you look further ahead or behind:

LAG(salary, 1): Previous row
LAG(salary, 2): Two rows back
LEAD(salary, 1): Next row
LEAD(salary, 2): Two rows ahead

LAG() and LEAD() are much more efficient and readable than self-joins for comparing adjacent rows. They're perfect for time-series analysis, calculating deltas, and finding sequential patterns.

NTILE() - Distribution into Buckets

NTILE(n) divides rows into n roughly equal buckets and assigns a bucket number (1 to n) to each row. This is useful for quartile analysis, percentile groupings, or dividing data into equal segments.

sql

-- Divide employees into salary quartiles
SELECT
  employee_id,
  name,
  department,
  salary,
  NTILE(4) OVER (ORDER BY salary DESC) AS salary_quartile,
  CASE
    WHEN NTILE(4) OVER (ORDER BY salary DESC) = 1 THEN 'Top 25%'
    WHEN NTILE(4) OVER (ORDER BY salary DESC) = 2 THEN '25-50%'
    WHEN NTILE(4) OVER (ORDER BY salary DESC) = 3 THEN '50-75%'
    ELSE 'Bottom 25%'
  END AS salary_bracket
FROM employees
ORDER BY salary DESC;

With 9 employees and 4 quartiles, NTILE distributes them as evenly as possible: 2, 2, 2, 3 (the remainder goes to the last buckets).

Here's a practical use case - dividing customers into value segments:

sql

-- Segment customers by lifetime value into 3 tiers
WITH customer_value AS (
  SELECT
    c.customer_id,
    c.customer_name,
    COUNT(o.order_id) AS total_orders,
    SUM(o.total_amount) AS lifetime_value
  FROM customers c
  LEFT JOIN orders o ON c.customer_id = o.customer_id
  GROUP BY c.customer_id, c.customer_name
)
SELECT
  customer_id,
  customer_name,
  total_orders,
  ROUND(lifetime_value, 2) AS lifetime_value,
  NTILE(3) OVER (ORDER BY lifetime_value DESC) AS value_tier,
  CASE
    WHEN NTILE(3) OVER (ORDER BY lifetime_value DESC) = 1 THEN 'High Value'
    WHEN NTILE(3) OVER (ORDER BY lifetime_value DESC) = 2 THEN 'Medium Value'
    ELSE 'Low Value'
  END AS customer_segment
FROM customer_value
WHERE lifetime_value > 0
ORDER BY lifetime_value DESC;

NTILE() is perfect for segmentation, A/B testing group assignment, and percentile analysis.

Real-World Example: Sales Performance Dashboard

Let's combine multiple window functions to create a comprehensive sales performance analysis:

sql

-- Complete sales rep performance analysis
WITH sales_data AS (
  SELECT
    e.employee_id,
    e.name AS sales_rep,
    e.hire_date,
    COUNT(o.order_id) AS total_orders,
    COALESCE(SUM(o.total_amount), 0) AS total_sales,
    COALESCE(AVG(o.total_amount), 0) AS avg_order_value
  FROM employees e
  LEFT JOIN orders o ON e.employee_id = o.sales_rep_id
  WHERE e.department = 'Sales'
  GROUP BY e.employee_id, e.name, e.hire_date
)
SELECT
  sales_rep,
  total_orders,
  ROUND(total_sales, 2) AS total_sales,
  ROUND(avg_order_value, 2) AS avg_order_value,
  -- Ranking functions
  RANK() OVER (ORDER BY total_sales DESC) AS sales_rank,
  DENSE_RANK() OVER (ORDER BY total_orders DESC) AS orders_rank,
  -- Percentile segmentation
  NTILE(3) OVER (ORDER BY total_sales DESC) AS performance_tier,
  -- Comparison with team
  ROUND(AVG(total_sales) OVER (), 2) AS team_avg_sales,
  ROUND(total_sales - AVG(total_sales) OVER (), 2) AS vs_team_avg,
  ROUND(100.0 * total_sales / SUM(total_sales) OVER (), 1) AS pct_of_total_sales,
  -- Sequential comparison
  LAG(total_sales, 1) OVER (ORDER BY total_sales DESC) AS next_higher_sales,
  ROUND(
    100.0 * total_sales / LAG(total_sales, 1) OVER (ORDER BY total_sales DESC),
    1
  ) AS pct_of_next_higher
FROM sales_data
ORDER BY total_sales DESC;

This single query provides:

Rankings: Sales rank and orders rank to identify top performers
Tiers: Performance tier (1 = top third) for segmentation
Team comparisons: Average sales and difference from average
Contribution: Percentage of total team sales
Sequential comparison: How close each rep is to the next higher performer

All of this in one query, with each row showing individual and comparative metrics!

Common Mistakes

1. Filtering on window functions in WHERE clause

❌ Wrong:

SELECT name, salary,
  RANK() OVER (ORDER BY salary DESC) AS rank
FROM employees
WHERE rank <= 3;  -- Error! Can't use window function in WHERE

✅ Correct:

WITH ranked AS (
  SELECT name, salary,
    RANK() OVER (ORDER BY salary DESC) AS rank
  FROM employees
)
SELECT * FROM ranked WHERE rank <= 3;

2. Forgetting ORDER BY in ranking functions

❌ Wrong:

SELECT name, salary,
  RANK() OVER (PARTITION BY department) AS rank  -- Missing ORDER BY!
FROM employees;

✅ Correct:

SELECT name, salary,
  RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS rank
FROM employees;

3. Using GROUP BY with window functions unnecessarily

Window functions work on individual rows, not aggregated groups. If you use GROUP BY, the window function operates on the grouped results.

❌ Confusing:

SELECT department, AVG(salary) AS avg_salary,
  RANK() OVER (ORDER BY AVG(salary) DESC) AS rank
FROM employees
GROUP BY department;  -- GROUP BY changes what rows window function sees

✅ Clearer (if you want individual rows):

SELECT name, department, salary,
  AVG(salary) OVER (PARTITION BY department) AS dept_avg
FROM employees;

4. Rewriting the same window definition multiple times

You can use the WINDOW clause to name a window definition and reuse it:

❌ Repetitive:

SELECT name,
  RANK() OVER (PARTITION BY department ORDER BY salary DESC),
  ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC)
FROM employees;

✅ Better:

SELECT name,
  RANK() OVER w,
  ROW_NUMBER() OVER w
FROM employees
WINDOW w AS (PARTITION BY department ORDER BY salary DESC);

Key Takeaways

Window functions use OVER() to define the window of rows to operate on
PARTITION BY divides data into groups; calculations restart for each group
ORDER BY within OVER() determines row order for the calculation
ROW_NUMBER() assigns unique sequential numbers (1, 2, 3, 4...)
RANK() handles ties with gaps (1, 2, 2, 4, 5...)
DENSE_RANK() handles ties without gaps (1, 2, 2, 3, 4...)
Aggregate functions (SUM, AVG, etc.) as window functions create running calculations
LAG() and LEAD() access previous or next rows without self-joins
NTILE(n) divides rows into n buckets for quartile/percentile analysis
Window functions preserve individual rows unlike GROUP BY
Filter window function results with CTEs, not WHERE clauses
Window functions are essential for analytics, reporting, and business intelligence

Master window functions and you'll write analytics queries that are 10x more efficient and readable than complex self-joins or correlated subqueries. They're the secret weapon of SQL data analysts.

Practice Exercise

Using the following tables:

orders (order_id, customer_id, order_date, total_amount, sales_rep_id)
customers (customer_id, customer_name, signup_date, customer_tier)
employees (employee_id, name, department)

Write queries using window functions to:

Rank customers by total spending within each customer tier
Calculate each customer's running total of spending ordered by date
Find the top 3 sales reps by total revenue, showing how far behind each is from the #1 rep
Calculate month-over-month growth percentage for total sales
Divide all orders into 10 equal buckets by order value and calculate the average order value for each bucket

Bonus challenge: Create a single query that shows each customer's most recent order, previous order amount, and how many days between those two orders using LAG().

Window FunctionsFocusStart Focus Mode

Basic Window Function SyntaxFocusStart Focus Mode

ROW_NUMBER() - Sequential NumberingFocusStart Focus Mode

RANK() and DENSE_RANK() - Handling TiesFocusStart Focus Mode

Aggregate Window FunctionsFocusStart Focus Mode

LAG() and LEAD() - Access Adjacent RowsFocusStart Focus Mode

NTILE() - Distribution into BucketsFocusStart Focus Mode

Real-World Example: Sales Performance DashboardFocusStart Focus Mode

Common MistakesFocusStart Focus Mode

Key TakeawaysFocusStart Focus Mode

Practice ExerciseFocusStart Focus Mode