SQL for Analytics

Write production-quality analytical queries.

18 min read
Beginner

SQL for Analytics

SQL is one of the most powerful tools for data analytics, allowing you to extract insights from large datasets, calculate business metrics, and answer critical questions about your data.

In this lesson, we'll explore how SQL is used in real-world analytics scenarios, from calculating key performance indicators (KPIs) to building complex analytical queries that power business intelligence dashboards.

Why SQL for Analytics?

SQL excels at analytics because it:

  • Handles large volumes of data efficiently
  • Provides powerful aggregation and grouping capabilities
  • Supports complex calculations and transformations
  • Integrates seamlessly with BI tools like Tableau, Power BI, and Looker
  • Enables reproducible and version-controlled analysis

Understanding Business Metrics

Quantifiable measures used to evaluate the success of an organization, business unit, or specific activity. SQL is commonly used to calculate and track KPIs over time.

Common business metrics calculated with SQL include:

  • Revenue metrics: Total revenue, average order value, revenue growth rate
  • Customer metrics: Customer acquisition cost, customer lifetime value, churn rate
  • Product metrics: Conversion rate, cart abandonment rate, product popularity
  • Operational metrics: Average response time, inventory turnover, fulfillment rate

Example: Calculating Monthly Revenue

Let's calculate total revenue by month from an orders table:

sql
SELECT
  DATE_TRUNC('month', order_date) AS month,
  COUNT(*) AS total_orders,
  SUM(total_amount) AS total_revenue,
  AVG(total_amount) AS avg_order_value,
  COUNT(DISTINCT customer_id) AS unique_customers
FROM orders
WHERE order_date >= '2025-01-01'
GROUP BY DATE_TRUNC('month', order_date)
ORDER BY month;
The DATE_TRUNC function is database-specific. In MySQL, use DATE_FORMAT(order_date, '%Y-%m-01'). In SQL Server, use DATEADD(month, DATEDIFF(month, 0, order_date), 0).

Cohort Analysis

A technique that groups customers or users based on a shared characteristic or experience within a defined time period, then tracks their behavior over time.

Cohort analysis is essential for understanding customer retention and lifetime value. The most common type is a time-based cohort, where users are grouped by their signup or first purchase date.

Example: Customer Retention Cohort

This query shows how many customers from each monthly cohort made purchases in subsequent months:

sql
WITH customer_cohorts AS (
  SELECT
    customer_id,
    DATE_TRUNC('month', MIN(order_date)) AS cohort_month
  FROM orders
  GROUP BY customer_id
),
cohort_activity AS (
  SELECT
    c.cohort_month,
    DATE_TRUNC('month', o.order_date) AS activity_month,
    COUNT(DISTINCT o.customer_id) AS active_customers
  FROM customer_cohorts c
  JOIN orders o ON c.customer_id = o.customer_id
  GROUP BY c.cohort_month, DATE_TRUNC('month', o.order_date)
)
SELECT
  cohort_month,
  activity_month,
  active_customers,
  ROUND(100.0 * active_customers /
    FIRST_VALUE(active_customers) OVER (
      PARTITION BY cohort_month
      ORDER BY activity_month
    ), 2) AS retention_rate
FROM cohort_activity
ORDER BY cohort_month, activity_month;

This analysis shows that customers who joined in January 2025 had a 59.94% retention rate in month 2 and 50% in month 3.

Time-Series Analysis

Time-series analysis involves examining data points ordered by time to identify trends, patterns, and seasonality.

Example: Year-over-Year Growth

Calculate how revenue compares to the same period last year:

sql
WITH monthly_revenue AS (
  SELECT
    DATE_TRUNC('month', order_date) AS month,
    SUM(total_amount) AS revenue
  FROM orders
  GROUP BY DATE_TRUNC('month', order_date)
)
SELECT
  month,
  revenue AS current_revenue,
  LAG(revenue, 12) OVER (ORDER BY month) AS previous_year_revenue,
  ROUND(100.0 * (revenue - LAG(revenue, 12) OVER (ORDER BY month)) /
    LAG(revenue, 12) OVER (ORDER BY month), 2) AS yoy_growth_pct
FROM monthly_revenue
ORDER BY month DESC
LIMIT 6;
The LAG function with offset 12 retrieves the value from 12 months ago, perfect for year-over-year comparisons. Use offset 3 for quarter-over-quarter or offset 1 for month-over-month.

Funnel Analysis

Funnel analysis tracks how users progress through a series of steps toward a goal, identifying where drop-offs occur.

Example: E-commerce Purchase Funnel

Track the conversion funnel from product view to purchase:

sql
WITH funnel_steps AS (
  SELECT
    'Product View' AS step,
    1 AS step_order,
    COUNT(DISTINCT user_id) AS users
  FROM product_views
  WHERE view_date >= '2025-03-01'

  UNION ALL

  SELECT
    'Add to Cart' AS step,
    2 AS step_order,
    COUNT(DISTINCT user_id) AS users
  FROM cart_additions
  WHERE added_date >= '2025-03-01'

  UNION ALL

  SELECT
    'Checkout Started' AS step,
    3 AS step_order,
    COUNT(DISTINCT user_id) AS users
  FROM checkout_starts
  WHERE checkout_date >= '2025-03-01'

  UNION ALL

  SELECT
    'Purchase Completed' AS step,
    4 AS step_order,
    COUNT(DISTINCT user_id) AS users
  FROM orders
  WHERE order_date >= '2025-03-01'
)
SELECT
  step,
  users,
  ROUND(100.0 * users / FIRST_VALUE(users) OVER (ORDER BY step_order), 2) AS overall_conversion,
  ROUND(100.0 * users / LAG(users) OVER (ORDER BY step_order), 2) AS step_conversion
FROM funnel_steps
ORDER BY step_order;

This analysis reveals that 31.5% of viewers add items to cart, but only 15.12% complete a purchase. The biggest drop-off is from viewing to adding to cart (68.5% drop), while checkout has a strong 80% completion rate.

RFM Analysis

A customer segmentation technique based on three dimensions: Recency (how recently did they purchase), Frequency (how often do they purchase), and Monetary value (how much do they spend).

RFM analysis helps identify your best customers and target marketing efforts effectively.

Example: Calculating RFM Scores

sql
WITH customer_rfm AS (
  SELECT
    customer_id,
    CURRENT_DATE - MAX(order_date) AS recency_days,
    COUNT(*) AS frequency,
    SUM(total_amount) AS monetary
  FROM orders
  WHERE order_date >= CURRENT_DATE - INTERVAL '365 days'
  GROUP BY customer_id
),
rfm_scores AS (
  SELECT
    customer_id,
    recency_days,
    frequency,
    monetary,
    NTILE(5) OVER (ORDER BY recency_days DESC) AS r_score,
    NTILE(5) OVER (ORDER BY frequency) AS f_score,
    NTILE(5) OVER (ORDER BY monetary) AS m_score
  FROM customer_rfm
)
SELECT
  customer_id,
  recency_days,
  frequency,
  ROUND(monetary, 2) AS monetary_value,
  r_score || f_score || m_score AS rfm_score,
  CASE
    WHEN r_score >= 4 AND f_score >= 4 AND m_score >= 4 THEN 'Champions'
    WHEN r_score >= 3 AND f_score >= 3 THEN 'Loyal Customers'
    WHEN r_score >= 4 AND f_score <= 2 THEN 'New Customers'
    WHEN r_score <= 2 AND f_score >= 3 THEN 'At Risk'
    WHEN r_score <= 2 AND f_score <= 2 THEN 'Lost'
    ELSE 'Potential'
  END AS customer_segment
FROM rfm_scores
ORDER BY r_score DESC, f_score DESC, m_score DESC
LIMIT 10;
NTILE(5) divides customers into 5 equal groups. Higher scores (5) indicate better performance for frequency and monetary, but for recency, higher scores mean less recent purchases, so we ORDER BY recency_days DESC to assign 5 to most recent customers.

Attribution Analysis

Attribution analysis determines which marketing channels or touchpoints deserve credit for conversions.

Example: Last-Touch Attribution

Identify the last marketing channel a customer interacted with before purchasing:

sql
WITH last_touch AS (
  SELECT DISTINCT ON (o.order_id)
    o.order_id,
    o.customer_id,
    o.total_amount,
    m.channel,
    m.interaction_date
  FROM orders o
  LEFT JOIN marketing_interactions m
    ON o.customer_id = m.customer_id
    AND m.interaction_date <= o.order_date
  WHERE o.order_date >= '2025-03-01'
  ORDER BY o.order_id, m.interaction_date DESC
)
SELECT
  channel,
  COUNT(*) AS attributed_orders,
  SUM(total_amount) AS attributed_revenue,
  ROUND(AVG(total_amount), 2) AS avg_order_value,
  ROUND(100.0 * COUNT(*) / SUM(COUNT(*)) OVER (), 2) AS pct_of_orders
FROM last_touch
GROUP BY channel
ORDER BY attributed_revenue DESC;

This shows Google Ads drives the most revenue (32.56% of orders), making it the highest-performing channel in a last-touch attribution model.

Performance Optimization for Analytics

Analytical queries often process large volumes of data. Here are key optimization strategies:

1. Use Appropriate Indexes

Create indexes on columns frequently used in WHERE, JOIN, and GROUP BY clauses:

sql
-- Index on date columns for time-series analysis
CREATE INDEX idx_orders_date ON orders(order_date);

-- Index on foreign keys for joins
CREATE INDEX idx_orders_customer ON orders(customer_id);

-- Composite index for common query patterns
CREATE INDEX idx_orders_date_customer ON orders(order_date, customer_id);

2. Use CTEs for Readability and Reusability

Common Table Expressions make complex queries more maintainable:

sql
-- Good: Clear, readable structure
WITH daily_sales AS (
  SELECT DATE(order_date) AS sale_date, SUM(total_amount) AS revenue
  FROM orders
  WHERE order_date >= CURRENT_DATE - INTERVAL '30 days'
  GROUP BY DATE(order_date)
)
SELECT
  sale_date,
  revenue,
  AVG(revenue) OVER (ORDER BY sale_date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS moving_avg_7d
FROM daily_sales;

3. Aggregate Before Joining

Reduce data volume early in your query:

sql
-- Good: Aggregate first, then join
WITH customer_totals AS (
  SELECT customer_id, COUNT(*) AS order_count, SUM(total_amount) AS total_spent
  FROM orders
  GROUP BY customer_id
)
SELECT c.name, ct.order_count, ct.total_spent
FROM customers c
JOIN customer_totals ct ON c.customer_id = ct.customer_id;

-- Avoid: Joining then aggregating creates a large intermediate result
-- SELECT c.name, COUNT(*), SUM(o.total_amount)
-- FROM customers c
-- JOIN orders o ON c.customer_id = o.customer_id
-- GROUP BY c.name;
Be cautious with window functions over large datasets. They can be memory-intensive. Consider using materialized views or pre-aggregated tables for frequently-run analytical queries.

Common Analytical Mistakes

1. Not Handling NULLs Properly

Problem: Aggregations can produce unexpected results with NULL values.

sql
-- Wrong: Division by zero or NULL
SELECT
  product_id,
  SUM(quantity_sold) / COUNT(*) AS avg_per_order
FROM order_items
GROUP BY product_id;

-- Correct: Use COUNT(column) to exclude NULLs, or COALESCE
SELECT
  product_id,
  COALESCE(SUM(quantity_sold) / NULLIF(COUNT(quantity_sold), 0), 0) AS avg_per_order
FROM order_items
GROUP BY product_id;

2. Incorrect Date Comparisons

Problem: Date ranges that miss edge cases.

sql
-- Wrong: Misses orders on 2025-03-31
SELECT * FROM orders
WHERE order_date >= '2025-03-01' AND order_date < '2025-03-31';

-- Correct: Include the entire last day
SELECT * FROM orders
WHERE order_date >= '2025-03-01' AND order_date < '2025-04-01';

3. Not Considering Time Zones

Problem: Data from different time zones can skew daily metrics.

Always normalize timestamps to a consistent time zone (typically UTC) before performing date-based aggregations. Use AT TIME ZONE in PostgreSQL or CONVERT_TZ in MySQL.

Key Takeaways

  • SQL is essential for analytics: It provides powerful tools for calculating KPIs, performing cohort analysis, and tracking trends over time
  • Cohort analysis reveals customer retention and lifetime value patterns by grouping users based on shared characteristics
  • Funnel analysis identifies conversion bottlenecks by tracking user progression through key steps
  • RFM analysis segments customers based on Recency, Frequency, and Monetary value for targeted marketing
  • Window functions are crucial for time-series analysis, enabling year-over-year comparisons and moving averages
  • Optimize analytical queries by using indexes, CTEs, and aggregating before joining
  • Handle NULLs, dates, and time zones carefully to avoid incorrect analytical results
  • Attribution analysis helps determine which marketing channels drive the most value

Mastering SQL for analytics enables you to extract actionable insights from data, drive data-informed decisions, and create valuable reports and dashboards for your organization.

Practice Exercises

  1. Customer Lifetime Value: Write a query to calculate the total lifetime value and average order frequency for each customer. Include only customers with at least 3 orders.

  2. Revenue Trending: Calculate month-over-month revenue growth percentage. Show months where revenue decreased compared to the previous month.

  3. Product Performance: Create a funnel showing product views → add to cart → purchase completion rates for each product category. Identify which category has the best conversion.

  4. Cohort Retention Matrix: Build a cohort retention matrix showing the percentage of customers from each monthly cohort who made purchases in months 0, 1, 2, and 3 after their first purchase.

  5. Marketing ROI: Calculate the return on investment for each marketing channel by comparing attributed revenue to marketing spend. Include channels with at least $5,000 in spend.

  6. Churn Prediction: Identify customers who haven't made a purchase in 90+ days but were previously active (3+ orders in the prior 180 days). Calculate what percentage of revenue they represented.