Financial Time Series Foundations
Master stationarity, autocorrelation, volatility clustering, and why normal distribution assumptions fail for financial data
Introduction
Financial price data is a time series: observations indexed by time. Understanding the statistical properties of financial time series is crucial for building indicators and strategies that actually work.
Unlike simple datasets (heights of people, test scores), financial time series have unique properties:
- Non-stationarity: Mean and variance change over time
- Volatility clustering: Big moves tend to follow big moves
- Fat tails: Extreme events happen more often than normal distributions predict
- Serial correlation: Today's returns can predict tomorrow's (weak but exploitable)
This lesson covers:
- The difference between price and returns
- Stationarity and why it matters
- Autocorrelation and serial dependence
- Volatility clustering
- Why normal distribution assumptions fail
Price vs. Returns: A Critical Distinction
Technical analysis can work in price space or return space. Understanding the difference is fundamental.
Price Series
Price is the absolute level: 150, $200. Price series have problems:
- Non-stationary: Mean and variance change as price trends
- Scale-dependent: A 50 stock vs. a $500 stock
- Non-comparable: Can't compare AAPL (150) directly
import yfinance as yf
import pandas as pd
import matplotlib.pyplot as plt
# Download Apple stock data
aapl = yf.download('AAPL', start='2020-01-01', end='2024-01-01', progress=False)
# Plot price
plt.figure(figsize=(12, 4))
plt.plot(aapl['Close'])
plt.title('AAPL Price (Non-Stationary)')
plt.ylabel('Price ($)')
plt.show()
# Calculate basic statistics
print(f"Mean price: ${aapl['Close'].mean():.2f}")
print(f"Std dev: ${aapl['Close'].std():.2f}")
print(f"Min: ${aapl['Close'].min():.2f}, Max: ${aapl['Close'].max():.2f}")The "mean price" of 100 (2020) and 156.
Return Series
Returns measure percentage change. There are two types:
1. Simple (Arithmetic) Returns
Where is price at time .
2. Log Returns
Log returns are preferred in finance because:
- Time additivity: (you can sum them)
- Symmetry: A 50% gain followed by a 50% loss gives you -0.29 log return (showing actual loss)
- Normality: Log returns are more normally distributed than simple returns
import numpy as np
# Calculate both types of returns
aapl['Simple_Return'] = aapl['Close'].pct_change()
aapl['Log_Return'] = np.log(aapl['Close'] / aapl['Close'].shift(1))
# Compare the two
print("Simple vs Log Returns (first 10 non-null):")
print(aapl[['Close', 'Simple_Return', 'Log_Return']].dropna().head(10))
# Statistics on returns
print("\nReturn statistics:")
print(f"Mean simple return: {aapl['Simple_Return'].mean():.6f} ({aapl['Simple_Return'].mean()*252:.2%} annualized)")
print(f"Mean log return: {aapl['Log_Return'].mean():.6f} ({aapl['Log_Return'].mean()*252:.2%} annualized)")
print(f"Std dev (simple): {aapl['Simple_Return'].std():.6f} ({aapl['Simple_Return'].std()*np.sqrt(252):.2%} annualized)")Notice that returns are much more stationary than prices: they fluctuate around a small mean (~0.12% daily) rather than trending.
Pro tip: Use log returns for analytical work (calculating statistics, building models) and simple returns when presenting results to end users (more intuitive: "you made 15%").
Stationarity: The Foundation of Time Series Analysis
A time series is stationary if its statistical properties (mean, variance, autocorrelation) don't change over time.
Why Stationarity Matters
Most statistical techniques assume stationarity. If you fit an indicator to non-stationary data:
- Parameters optimized on the past won't work in the future
- Backtests are unreliable (you're training on a different distribution than you'll trade)
- Risk estimates are wrong
Key insight: Price is non-stationary (trends), but returns are approximately stationary (fluctuate around constant mean).
A time series is strictly stationary if the joint distribution of is the same as for any time shift .
Weak stationarity (more practical) requires:
- Constant mean: for all
- Constant variance: for all
- Autocovariance depends only on lag: is a function of only, not
Financial returns are approximately weakly stationary over short to medium horizons.
# Test stationarity: compare statistics in first half vs. second half
midpoint = len(aapl) // 2
first_half = aapl['Log_Return'].iloc[:midpoint]
second_half = aapl['Log_Return'].iloc[midpoint:]
print("Stationarity Test: First Half vs. Second Half")
print(f"First half mean: {first_half.mean():.6f}, std: {first_half.std():.6f}")
print(f"Second half mean: {second_half.mean():.6f}, std: {second_half.std():.6f}")
print(f"\nMean difference: {abs(first_half.mean() - second_half.mean()):.6f}")
print(f"Std difference: {abs(first_half.std() - second_half.std()):.6f}")
# For comparison, test non-stationary price
first_half_price = aapl['Close'].iloc[:midpoint]
second_half_price = aapl['Close'].iloc[midpoint:]
print(f"\nPrice (non-stationary):")
print(f"First half mean: ${first_half_price.mean():.2f}, std: ${first_half_price.std():.2f}")
print(f"Second half mean: ${second_half_price.mean():.2f}, std: ${second_half_price.std():.2f}")Returns show similar mean and std across time periods (stationary), while price mean shifted dramatically from 168 (non-stationary).
Autocorrelation: Does the Past Predict the Future?
Autocorrelation measures how correlated a time series is with its own lagged values. It's the foundation of momentum and mean-reversion strategies.
Autocorrelation Formula
Where:
- is autocorrelation at lag
- is return at time
- is the lag (1 day, 5 days, 20 days, etc.)
Interpretation:
- : Positive autocorrelation (momentum) - positive returns tend to follow positive returns
- : Negative autocorrelation (mean reversion) - positive returns tend to be followed by negative returns
- : No predictive relationship (random walk)
import pandas as pd
# Calculate autocorrelation for lags 1 to 20
lags = range(1, 21)
autocorr = [aapl['Log_Return'].autocorr(lag=lag) for lag in lags]
# Create a table of results
autocorr_df = pd.DataFrame({
'Lag': lags,
'Autocorrelation': autocorr
})
print("Autocorrelation of AAPL daily log returns:")
print(autocorr_df.head(10))
# Plot autocorrelation
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 4))
plt.bar(lags, autocorr)
plt.axhline(y=0, color='black', linestyle='-', linewidth=0.8)
plt.axhline(y=0.05, color='red', linestyle='--', linewidth=0.8, alpha=0.5, label='±0.05 threshold')
plt.axhline(y=-0.05, color='red', linestyle='--', linewidth=0.8, alpha=0.5)
plt.title('Autocorrelation Function (ACF)')
plt.xlabel('Lag (days)')
plt.ylabel('Autocorrelation')
plt.legend()
plt.show()Interpreting the Results
For AAPL:
- Lag 1 autocorrelation: -0.015 (very weak mean reversion at daily scale)
- Most lags near zero: Daily returns are nearly unpredictable from past daily returns
- Random walk hypothesis: Short-term price changes are largely random
However, at different time scales, patterns emerge:
- Momentum: 3-12 month returns show positive autocorrelation (winners keep winning)
- Mean reversion: Very short-term (intraday) and very long-term (multi-year) show mean reversion
Key insight: Daily stock returns have very low autocorrelation (nearly random), but this doesn't mean markets are completely unpredictable. Autocorrelation varies by time scale, asset class, and market regime.
Volatility Clustering: Big Moves Follow Big Moves
Volatility clustering is the phenomenon where large price changes tend to be followed by large price changes (of either sign), and small changes tend to be followed by small changes.
The Pattern
Look at absolute or squared returns or :
# Calculate squared returns (proxy for volatility)
aapl['Squared_Return'] = aapl['Log_Return'] ** 2
# Calculate autocorrelation of squared returns
lags = range(1, 21)
vol_autocorr = [aapl['Squared_Return'].autocorr(lag=lag) for lag in lags]
print("Autocorrelation of squared returns (volatility):")
for lag, corr in zip(lags[:10], vol_autocorr[:10]):
print(f"Lag {lag:2d}: {corr:.4f}")
# Compare: returns have low autocorr, but volatility has high autocorr
print(f"\nReturn autocorr (lag 1): {aapl['Log_Return'].autocorr(1):.4f}")
print(f"Volatility autocorr (lag 1): {vol_autocorr[0]:.4f}")Interpretation:
- Returns autocorrelation: -0.0152 (nearly zero, essentially random)
- Volatility autocorrelation: 0.2845 (strong positive correlation)
Conclusion: You can't predict the direction of tomorrow's return, but you can predict that if today was volatile, tomorrow will likely be volatile too.
Why Volatility Clustering Matters
- Risk management: After big moves, increase position size limits or use wider stops
- Strategy timing: Trend-following works better in high-volatility regimes
- Indicator parameters: Consider adaptive parameters based on recent volatility
Practical application: Use rolling volatility (e.g., 20-day standard deviation) as a signal. When volatility spikes, reduce position sizes or tighten stops. When volatility is low, you can afford larger positions.
Fat Tails and the Failure of Normal Distribution
Many models assume returns are normally distributed (bell curve). This is dangerously wrong for financial data.
Normal Distribution vs. Reality
Normal distribution properties:
- 68% of observations within 1 standard deviation
- 95% within 2 standard deviations
- 99.7% within 3 standard deviations
- Events beyond 4-5 standard deviations are virtually impossible
Reality: Financial returns have fat tails - extreme events happen far more often than the normal distribution predicts.
from scipy import stats
import numpy as np
# Calculate how often extreme events occur
returns = aapl['Log_Return'].dropna()
mean_return = returns.mean()
std_return = returns.std()
# Count events beyond 2, 3, 4 standard deviations
beyond_2std = (abs(returns - mean_return) > 2 * std_return).sum()
beyond_3std = (abs(returns - mean_return) > 3 * std_return).sum()
beyond_4std = (abs(returns - mean_return) > 4 * std_return).sum()
total_days = len(returns)
print("Extreme Event Frequency (AAPL daily returns):")
print(f"Total trading days: {total_days}")
print(f"\nActual vs. Normal Distribution:")
print(f"Beyond 2 std: {beyond_2std} days ({beyond_2std/total_days:.2%}) vs. {0.05:.2%} expected")
print(f"Beyond 3 std: {beyond_3std} days ({beyond_3std/total_days:.2%}) vs. {0.003:.2%} expected")
print(f"Beyond 4 std: {beyond_4std} days ({beyond_4std/total_days:.3%}) vs. {0.00006:.3%} expected")
# Find the worst day
worst_day_idx = returns.abs().idxmax()
worst_day_return = returns.loc[worst_day_idx]
worst_day_std = abs(worst_day_return - mean_return) / std_return
print(f"\nWorst single day: {worst_day_idx.date()}")
print(f"Return: {worst_day_return:.4f} ({worst_day_return*100:.2f}%)")
print(f"Standard deviations from mean: {worst_day_std:.2f}")Analysis:
- Beyond 3 std: Expected 0.3% (3 days), observed 1.79% (18 days) - 6x more frequent
- Beyond 4 std: Expected 0.006% (0.06 days), observed 0.79% (8 days) - 132x more frequent
- Worst day: -12.89% (6.54 standard deviations) - this is a 1-in-several-billion event under normal distribution, yet it happened
Implications for Trading
- Risk models underestimate tail risk: VaR models assuming normality will blow up
- Stop losses get run more often: 3-std stops should trigger 0.3% of the time but actually trigger 1-2%
- Black swan events are not rare: Plan for 5-10 standard deviation moves (2008, 2020, etc.)
Critical: Never assume returns are normally distributed. Always stress-test strategies for extreme moves (10-20% single-day drops). The Black Monday crash (1987) was a 20-std event under normal distribution - impossible by that model, yet it happened.
Summary
Key Takeaways
- Use returns, not prices: Log returns are stationary, additive, and more suitable for analysis
- Stationarity matters: Most statistical techniques require stationarity; returns are approximately stationary
- Autocorrelation is weak but exploitable: Daily returns show little autocorrelation, but longer horizons exhibit momentum
- Volatility clusters: Large moves follow large moves - use this for risk management
- Fat tails are real: Extreme events occur far more often than normal distribution predicts - always plan for tail risk
Next Steps
Now that you understand the statistical foundations of financial time series, the next lesson covers OHLCV data structure: the standard format for financial data and how to work with open, high, low, close, and volume information.