Kekkei - Advancing Education For Everyone

Introduction

Before building trading strategies, you need quality market data. yfinance is a Python library that downloads historical and real-time data from Yahoo Finance for free.

This lesson covers:

Installing and using yfinance
Downloading historical OHLCV data
Multiple tickers and timeframes
Handling missing data and errors
Data validation best practices

Installing yfinance

Install yfinance using pip:

python

# Install yfinance
!pip install yfinance

# Import required libraries
import yfinance as yf
import pandas as pd
import numpy as np
from datetime import datetime, timedelta

print("yfinance installed successfully!")

Basic Data Download

The simplest way to get data is using yf.download():

python

# Download Apple stock data
aapl = yf.download('AAPL', start='2023-01-01', end='2024-01-01', progress=False)

print("Downloaded data shape:", aapl.shape)
print("\nFirst 5 rows:")
print(aapl.head())

print("\nColumn names:")
print(aapl.columns.tolist())

Understanding Adjusted Close

Adj Close (Adjusted Close) accounts for corporate actions:

Stock splits: 2-for-1 split makes historical prices half
Dividends: Dividend payments reduce stock value
Rights offerings: New shares issued

Use Adj Close for returns calculations, not raw Close, to get accurate historical performance.

python

# Compare Close vs Adj Close
comparison = aapl[['Close', 'Adj Close']].copy()
comparison['Difference'] = aapl['Close'] - aapl['Adj Close']
comparison['Diff_Pct'] = (comparison['Difference'] / aapl['Close']) * 100

print("Close vs Adj Close comparison:")
print(comparison.tail())

print(f"\nAverage difference: {comparison['Diff_Pct'].mean():.3f}%")

Always use Adj Close when calculating returns or comparing prices across time. The difference compounds - a 0.42% per-period difference becomes 10%+ over years.

Downloading Multiple Tickers

Download multiple stocks at once by passing a list:

python

# Download multiple tech stocks
tickers = ['AAPL', 'MSFT', 'GOOGL', 'AMZN']
data = yf.download(tickers, start='2023-01-01', end='2024-01-01', progress=False)

print("Data structure for multiple tickers:")
print("Shape:", data.shape)
print("\nColumns (MultiIndex):")
print(data.columns)

# Access specific ticker's close prices
aapl_close = data['Close']['AAPL']
print("\nAAPL Close prices (first 5):")
print(aapl_close.head())

# Compare all tickers' close prices
print("\nAll tickers' Close prices:")
print(data['Close'].tail())

Different Timeframes

yfinance supports multiple timeframes using the interval parameter:

Available Intervals

Interval	Description	Max History	Use Case
1m	1 minute	7 days	Intraday scalping
5m	5 minutes	60 days	Day trading
15m	15 minutes	60 days	Day trading
1h	1 hour	730 days	Swing trading entries
1d	1 day (default)	All history	Position trading
1wk	1 week	All history	Long-term analysis
1mo	1 month	All history	Macro trends

python

# Download hourly data (last 5 days)
hourly = yf.download('AAPL', period='5d', interval='1h', progress=False)

print("Hourly data:")
print(f"Shape: {hourly.shape}")
print(f"Bars per day: ~{hourly.shape[0] / 5:.0f}")
print("\nFirst few hours:")
print(hourly.head())

# Download weekly data
weekly = yf.download('AAPL', start='2020-01-01', end='2024-01-01', interval='1wk', progress=False)

print(f"\nWeekly data shape: {weekly.shape}")
print("Last 5 weeks:")
print(weekly.tail())

Using Period Shortcuts

Instead of specifying start/end dates, use period shortcuts:

python

# Valid period values: 1d, 5d, 1mo, 3mo, 6mo, 1y, 2y, 5y, 10y, ytd, max

# Last 6 months
data_6mo = yf.download('AAPL', period='6mo', progress=False)
print(f"Last 6 months: {data_6mo.shape[0]} days")

# Year to date
data_ytd = yf.download('AAPL', period='ytd', progress=False)
print(f"Year to date: {data_ytd.shape[0]} days")

# Maximum available history
data_max = yf.download('AAPL', period='max', progress=False)
print(f"All history: {data_max.shape[0]} days")
print(f"First date: {data_max.index[0]}")
print(f"Last date: {data_max.index[-1]}")

Handling Missing Data

Real-world data has gaps (holidays, trading halts, delisting). Always check and handle missing data:

python

# Download data
data = yf.download('AAPL', start='2023-01-01', end='2024-01-01', progress=False)

# Check for missing values
print("Missing values check:")
print(data.isnull().sum())

# Check for any rows with NaN
rows_with_nan = data[data.isnull().any(axis=1)]
print(f"\nRows with NaN: {len(rows_with_nan)}")

# Check data continuity (look for large gaps)
data['Date_Diff'] = data.index.to_series().diff().dt.days

large_gaps = data[data['Date_Diff'] > 5]
print(f"\nGaps larger than 5 days: {len(large_gaps)}")
if len(large_gaps) > 0:
    print(large_gaps[['Close', 'Date_Diff']])

# Forward fill missing data (use previous day's close)
data_filled = data.fillna(method='ffill')

print("\nData completeness:")
print(f"Original: {data.shape[0]} rows")
print(f"After filling: {data_filled.shape[0]} rows")

Important: For less liquid stocks or crypto, missing data is common. Always validate data quality before building strategies. A single missing bar can break your backtest.

Data Validation Pipeline

Build a robust data validation function:

python

def validate_ohlcv_data(df, ticker=''):
    """
    Validate OHLCV data quality.

    Checks:
    - Missing values
    - OHLC relationships (H >= L, H >= O/C, L <= O/C)
    - Negative prices
    - Zero volume
    - Outliers (price jumps >50%)

    Returns: dict with validation results
    """
    issues = []

    # 1. Check for missing values
    missing = df.isnull().sum().sum()
    if missing > 0:
        issues.append(f"Missing values: {missing}")

    # 2. Check OHLC relationships
    invalid_hl = (df['High'] < df['Low']).sum()
    if invalid_hl > 0:
        issues.append(f"Invalid H<L: {invalid_hl} bars")

    invalid_hc = (df['High'] < df['Close']).sum()
    if invalid_hc > 0:
        issues.append(f"Invalid H<C: {invalid_hc} bars")

    invalid_lc = (df['Low'] > df['Close']).sum()
    if invalid_lc > 0:
        issues.append(f"Invalid L>C: {invalid_lc} bars")

    # 3. Check for negative or zero prices
    negative_prices = (df[['Open', 'High', 'Low', 'Close']] <= 0).any(axis=1).sum()
    if negative_prices > 0:
        issues.append(f"Negative/zero prices: {negative_prices} bars")

    # 4. Check for zero volume
    zero_volume = (df['Volume'] == 0).sum()
    if zero_volume > 0:
        issues.append(f"Zero volume: {zero_volume} bars")

    # 5. Check for extreme jumps (>50% in one day)
    returns = df['Close'].pct_change()
    extreme_moves = (abs(returns) > 0.5).sum()
    if extreme_moves > 0:
        issues.append(f"Extreme moves (>50%): {extreme_moves} bars")

    # Summary
    result = {
        'ticker': ticker,
        'bars': len(df),
        'date_range': f"{df.index[0].date()} to {df.index[-1].date()}",
        'valid': len(issues) == 0,
        'issues': issues
    }

    return result

# Test validation
aapl = yf.download('AAPL', period='1y', progress=False)
validation = validate_ohlcv_data(aapl, 'AAPL')

print("Data Validation Report:")
print(f"Ticker: {validation['ticker']}")
print(f"Bars: {validation['bars']}")
print(f"Date Range: {validation['date_range']}")
print(f"Valid: {validation['valid']}")
if not validation['valid']:
    print(f"Issues found:")
    for issue in validation['issues']:
        print(f"  - {issue}")
else:
    print("✅ All validations passed!")

Saving and Loading Data

Save downloaded data to avoid repeated downloads:

python

# Download and save
data = yf.download('AAPL', period='max', progress=False)

# Save to CSV
data.to_csv('aapl_historical.csv')
print(f"Saved {len(data)} rows to aapl_historical.csv")

# Save to pickle (faster, preserves dtypes)
data.to_pickle('aapl_historical.pkl')
print(f"Saved to pickle format")

# Load from CSV
loaded_csv = pd.read_csv('aapl_historical.csv', index_col=0, parse_dates=True)
print(f"\nLoaded from CSV: {loaded_csv.shape}")

# Load from pickle
loaded_pkl = pd.read_pickle('aapl_historical.pkl')
print(f"Loaded from pickle: {loaded_pkl.shape}")

# Compare load times (pickle is much faster for large datasets)
import time

start = time.time()
_ = pd.read_csv('aapl_historical.csv', index_col=0, parse_dates=True)
csv_time = time.time() - start

start = time.time()
_ = pd.read_pickle('aapl_historical.pkl')
pkl_time = time.time() - start

print(f"\nLoad time comparison:")
print(f"CSV: {csv_time:.4f} seconds")
print(f"Pickle: {pkl_time:.4f} seconds")
print(f"Speedup: {csv_time/pkl_time:.1f}x")

Pro tip: Use pickle for fast loading during development, but save to CSV for long-term storage and portability (pickle format can change between pandas versions).

Summary

Key Takeaways

yfinance provides free access to Yahoo Finance data for stocks, ETFs, forex, crypto
Always use Adj Close for return calculations to account for splits and dividends
Multiple timeframes available: 1m to 1mo intervals with varying history limits
Validate data quality: Check for missing values, invalid OHLC relationships, and outliers
Save data locally in pickle format for fast loading during development
Handle missing data using forward-fill or other appropriate methods

Next Steps

You now have the data pipeline skills to download market data. Next, we move to Stage II: Price Action & Market Structure, starting with candlestick analysis: learning to interpret candlesticks as compressed order flow.

IntroductionFocusStart Focus Mode

Installing yfinanceFocusStart Focus Mode

Basic Data DownloadFocusStart Focus Mode

Understanding Adjusted Close

Downloading Multiple TickersFocusStart Focus Mode

Different TimeframesFocusStart Focus Mode

Using Period ShortcutsFocusStart Focus Mode

Handling Missing DataFocusStart Focus Mode

Data Validation PipelineFocusStart Focus Mode

Saving and Loading DataFocusStart Focus Mode

SummaryFocusStart Focus Mode

Key Takeaways

Next Steps

Introduction

Installing yfinance

Basic Data Download

Downloading Multiple Tickers

Different Timeframes

Using Period Shortcuts

Handling Missing Data

Data Validation Pipeline

Saving and Loading Data

Summary