Kekkei - Advancing Education For Everyone

Introduction

Financial price data is a time series: observations indexed by time. Understanding the statistical properties of financial time series is crucial for building indicators and strategies that actually work.

Unlike simple datasets (heights of people, test scores), financial time series have unique properties:

Non-stationarity: Mean and variance change over time
Volatility clustering: Big moves tend to follow big moves
Fat tails: Extreme events happen more often than normal distributions predict
Serial correlation: Today's returns can predict tomorrow's (weak but exploitable)

This lesson covers:

The difference between price and returns
Stationarity and why it matters
Autocorrelation and serial dependence
Volatility clustering
Why normal distribution assumptions fail

Price vs. Returns: A Critical Distinction

Technical analysis can work in price space or return space. Understanding the difference is fundamental.

Price Series

Price is the absolute level: $100,$ 150, $200. Price series have problems:

Non-stationary: Mean and variance change as price trends
Scale-dependent: A $10 move means different things for a$ 50 stock vs. a $500 stock
Non-comparable: Can't compare AAPL ( $180) to AMZN ($ 150) directly

python

import yfinance as yf
import pandas as pd
import matplotlib.pyplot as plt

# Download Apple stock data
aapl = yf.download('AAPL', start='2020-01-01', end='2024-01-01', progress=False)

# Plot price
plt.figure(figsize=(12, 4))
plt.plot(aapl['Close'])
plt.title('AAPL Price (Non-Stationary)')
plt.ylabel('Price ($)')
plt.show()

# Calculate basic statistics
print(f"Mean price: ${aapl['Close'].mean():.2f}")
print(f"Std dev: ${aapl['Close'].std():.2f}")
print(f"Min: ${aapl['Close'].min():.2f}, Max: ${aapl['Close'].max():.2f}")

The "mean price" of $156 is misleading: price spent time around$ 100 (2020) and $180 (2023), but rarely at$ 156.

Return Series

Returns measure percentage change. There are two types:

1. Simple (Arithmetic) Returns

$R_t = \frac{P_t - P_{t-1}}{P_{t-1}} = \frac{P_t}{P_{t-1}} - 1$

Where $P_t$ is price at time $t$ .

2. Log Returns

$r_t = \ln\left(\frac{P_t}{P_{t-1}}\right) = \ln(P_t) - \ln(P_{t-1})$

Log returns are preferred in finance because:

Time additivity: $r_{t} + r_{t+1} = r_{t \to t+2}$ (you can sum them)
Symmetry: A 50% gain followed by a 50% loss gives you -0.29 log return (showing actual loss)
Normality: Log returns are more normally distributed than simple returns

python

import numpy as np

# Calculate both types of returns
aapl['Simple_Return'] = aapl['Close'].pct_change()
aapl['Log_Return'] = np.log(aapl['Close'] / aapl['Close'].shift(1))

# Compare the two
print("Simple vs Log Returns (first 10 non-null):")
print(aapl[['Close', 'Simple_Return', 'Log_Return']].dropna().head(10))

# Statistics on returns
print("\nReturn statistics:")
print(f"Mean simple return: {aapl['Simple_Return'].mean():.6f} ({aapl['Simple_Return'].mean()*252:.2%} annualized)")
print(f"Mean log return: {aapl['Log_Return'].mean():.6f} ({aapl['Log_Return'].mean()*252:.2%} annualized)")
print(f"Std dev (simple): {aapl['Simple_Return'].std():.6f} ({aapl['Simple_Return'].std()*np.sqrt(252):.2%} annualized)")

Notice that returns are much more stationary than prices: they fluctuate around a small mean (~0.12% daily) rather than trending.

Pro tip: Use log returns for analytical work (calculating statistics, building models) and simple returns when presenting results to end users (more intuitive: "you made 15%").

Stationarity: The Foundation of Time Series Analysis

A time series is stationary if its statistical properties (mean, variance, autocorrelation) don't change over time.

Why Stationarity Matters

Most statistical techniques assume stationarity. If you fit an indicator to non-stationary data:

Parameters optimized on the past won't work in the future
Backtests are unreliable (you're training on a different distribution than you'll trade)
Risk estimates are wrong

Key insight: Price is non-stationary (trends), but returns are approximately stationary (fluctuate around constant mean).

A time series $X_t$ is strictly stationary if the joint distribution of $(X_{t_1}, X_{t_2}, ..., X_{t_n})$ is the same as $(X_{t_1+k}, X_{t_2+k}, ..., X_{t_n+k})$ for any time shift $k$ .

Weak stationarity (more practical) requires:

Constant mean: $E[X_t] = \mu$ for all $t$
Constant variance: $Var(X_t) = \sigma^2$ for all $t$
Autocovariance depends only on lag: $Cov(X_t, X_{t-k})$ is a function of $k$ only, not $t$

Financial returns are approximately weakly stationary over short to medium horizons.

python

# Test stationarity: compare statistics in first half vs. second half
midpoint = len(aapl) // 2
first_half = aapl['Log_Return'].iloc[:midpoint]
second_half = aapl['Log_Return'].iloc[midpoint:]

print("Stationarity Test: First Half vs. Second Half")
print(f"First half mean: {first_half.mean():.6f}, std: {first_half.std():.6f}")
print(f"Second half mean: {second_half.mean():.6f}, std: {second_half.std():.6f}")
print(f"\nMean difference: {abs(first_half.mean() - second_half.mean()):.6f}")
print(f"Std difference: {abs(first_half.std() - second_half.std()):.6f}")

# For comparison, test non-stationary price
first_half_price = aapl['Close'].iloc[:midpoint]
second_half_price = aapl['Close'].iloc[midpoint:]
print(f"\nPrice (non-stationary):")
print(f"First half mean: ${first_half_price.mean():.2f}, std: ${first_half_price.std():.2f}")
print(f"Second half mean: ${second_half_price.mean():.2f}, std: ${second_half_price.std():.2f}")

Returns show similar mean and std across time periods (stationary), while price mean shifted dramatically from $122 to$ 168 (non-stationary).

Autocorrelation: Does the Past Predict the Future?

Autocorrelation measures how correlated a time series is with its own lagged values. It's the foundation of momentum and mean-reversion strategies.

Autocorrelation Formula

$\rho_k = \frac{Cov(r_t, r_{t-k})}{Var(r_t)}$

Where:

$\rho_k$ is autocorrelation at lag $k$
$r_t$ is return at time $t$
$k$ is the lag (1 day, 5 days, 20 days, etc.)

Interpretation:

$\rho_k > 0$ : Positive autocorrelation (momentum) - positive returns tend to follow positive returns
$\rho_k < 0$ : Negative autocorrelation (mean reversion) - positive returns tend to be followed by negative returns
$\rho_k \approx 0$ : No predictive relationship (random walk)

python

import pandas as pd

# Calculate autocorrelation for lags 1 to 20
lags = range(1, 21)
autocorr = [aapl['Log_Return'].autocorr(lag=lag) for lag in lags]

# Create a table of results
autocorr_df = pd.DataFrame({
    'Lag': lags,
    'Autocorrelation': autocorr
})

print("Autocorrelation of AAPL daily log returns:")
print(autocorr_df.head(10))

# Plot autocorrelation
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 4))
plt.bar(lags, autocorr)
plt.axhline(y=0, color='black', linestyle='-', linewidth=0.8)
plt.axhline(y=0.05, color='red', linestyle='--', linewidth=0.8, alpha=0.5, label='±0.05 threshold')
plt.axhline(y=-0.05, color='red', linestyle='--', linewidth=0.8, alpha=0.5)
plt.title('Autocorrelation Function (ACF)')
plt.xlabel('Lag (days)')
plt.ylabel('Autocorrelation')
plt.legend()
plt.show()

Interpreting the Results

For AAPL:

Lag 1 autocorrelation: -0.015 (very weak mean reversion at daily scale)
Most lags near zero: Daily returns are nearly unpredictable from past daily returns
Random walk hypothesis: Short-term price changes are largely random

However, at different time scales, patterns emerge:

Momentum: 3-12 month returns show positive autocorrelation (winners keep winning)
Mean reversion: Very short-term (intraday) and very long-term (multi-year) show mean reversion

Key insight: Daily stock returns have very low autocorrelation (nearly random), but this doesn't mean markets are completely unpredictable. Autocorrelation varies by time scale, asset class, and market regime.

Volatility Clustering: Big Moves Follow Big Moves

Volatility clustering is the phenomenon where large price changes tend to be followed by large price changes (of either sign), and small changes tend to be followed by small changes.

The Pattern

Look at absolute or squared returns $|r_t|$ or $r_t^2$ :

python

# Calculate squared returns (proxy for volatility)
aapl['Squared_Return'] = aapl['Log_Return'] ** 2

# Calculate autocorrelation of squared returns
lags = range(1, 21)
vol_autocorr = [aapl['Squared_Return'].autocorr(lag=lag) for lag in lags]

print("Autocorrelation of squared returns (volatility):")
for lag, corr in zip(lags[:10], vol_autocorr[:10]):
    print(f"Lag {lag:2d}: {corr:.4f}")

# Compare: returns have low autocorr, but volatility has high autocorr
print(f"\nReturn autocorr (lag 1): {aapl['Log_Return'].autocorr(1):.4f}")
print(f"Volatility autocorr (lag 1): {vol_autocorr[0]:.4f}")

Interpretation:

Returns autocorrelation: -0.0152 (nearly zero, essentially random)
Volatility autocorrelation: 0.2845 (strong positive correlation)

Conclusion: You can't predict the direction of tomorrow's return, but you can predict that if today was volatile, tomorrow will likely be volatile too.

Why Volatility Clustering Matters

Risk management: After big moves, increase position size limits or use wider stops
Strategy timing: Trend-following works better in high-volatility regimes
Indicator parameters: Consider adaptive parameters based on recent volatility

Practical application: Use rolling volatility (e.g., 20-day standard deviation) as a signal. When volatility spikes, reduce position sizes or tighten stops. When volatility is low, you can afford larger positions.

Fat Tails and the Failure of Normal Distribution

Many models assume returns are normally distributed (bell curve). This is dangerously wrong for financial data.

Normal Distribution vs. Reality

Normal distribution properties:

68% of observations within 1 standard deviation
95% within 2 standard deviations
99.7% within 3 standard deviations
Events beyond 4-5 standard deviations are virtually impossible

Reality: Financial returns have fat tails - extreme events happen far more often than the normal distribution predicts.

python

from scipy import stats
import numpy as np

# Calculate how often extreme events occur
returns = aapl['Log_Return'].dropna()
mean_return = returns.mean()
std_return = returns.std()

# Count events beyond 2, 3, 4 standard deviations
beyond_2std = (abs(returns - mean_return) > 2 * std_return).sum()
beyond_3std = (abs(returns - mean_return) > 3 * std_return).sum()
beyond_4std = (abs(returns - mean_return) > 4 * std_return).sum()

total_days = len(returns)

print("Extreme Event Frequency (AAPL daily returns):")
print(f"Total trading days: {total_days}")
print(f"\nActual vs. Normal Distribution:")
print(f"Beyond 2 std: {beyond_2std} days ({beyond_2std/total_days:.2%}) vs. {0.05:.2%} expected")
print(f"Beyond 3 std: {beyond_3std} days ({beyond_3std/total_days:.2%}) vs. {0.003:.2%} expected")
print(f"Beyond 4 std: {beyond_4std} days ({beyond_4std/total_days:.3%}) vs. {0.00006:.3%} expected")

# Find the worst day
worst_day_idx = returns.abs().idxmax()
worst_day_return = returns.loc[worst_day_idx]
worst_day_std = abs(worst_day_return - mean_return) / std_return

print(f"\nWorst single day: {worst_day_idx.date()}")
print(f"Return: {worst_day_return:.4f} ({worst_day_return*100:.2f}%)")
print(f"Standard deviations from mean: {worst_day_std:.2f}")

Analysis:

Beyond 3 std: Expected 0.3% (3 days), observed 1.79% (18 days) - 6x more frequent
Beyond 4 std: Expected 0.006% (0.06 days), observed 0.79% (8 days) - 132x more frequent
Worst day: -12.89% (6.54 standard deviations) - this is a 1-in-several-billion event under normal distribution, yet it happened

Implications for Trading

Risk models underestimate tail risk: VaR models assuming normality will blow up
Stop losses get run more often: 3-std stops should trigger 0.3% of the time but actually trigger 1-2%
Black swan events are not rare: Plan for 5-10 standard deviation moves (2008, 2020, etc.)

Critical: Never assume returns are normally distributed. Always stress-test strategies for extreme moves (10-20% single-day drops). The Black Monday crash (1987) was a 20-std event under normal distribution - impossible by that model, yet it happened.

Summary

Key Takeaways

Use returns, not prices: Log returns are stationary, additive, and more suitable for analysis
Stationarity matters: Most statistical techniques require stationarity; returns are approximately stationary
Autocorrelation is weak but exploitable: Daily returns show little autocorrelation, but longer horizons exhibit momentum
Volatility clusters: Large moves follow large moves - use this for risk management
Fat tails are real: Extreme events occur far more often than normal distribution predicts - always plan for tail risk

Next Steps

Now that you understand the statistical foundations of financial time series, the next lesson covers OHLCV data structure: the standard format for financial data and how to work with open, high, low, close, and volume information.

IntroductionFocusStart Focus Mode

Price vs. Returns: A Critical DistinctionFocusStart Focus Mode

Price Series

Return Series

Stationarity: The Foundation of Time Series AnalysisFocusStart Focus Mode

Why Stationarity Matters

Autocorrelation: Does the Past Predict the Future?FocusStart Focus Mode

Autocorrelation Formula

Interpreting the Results

Volatility Clustering: Big Moves Follow Big MovesFocusStart Focus Mode

The Pattern

Why Volatility Clustering Matters

Fat Tails and the Failure of Normal DistributionFocusStart Focus Mode

Normal Distribution vs. Reality

Implications for Trading

SummaryFocusStart Focus Mode

Key Takeaways

Next Steps

Introduction

Price vs. Returns: A Critical Distinction

Stationarity: The Foundation of Time Series Analysis

Autocorrelation: Does the Past Predict the Future?

Volatility Clustering: Big Moves Follow Big Moves

Fat Tails and the Failure of Normal Distribution

Summary