Fetching Market Data with yfinance

Learn to download, validate, and manage financial data using Python and yfinance

25 min read
Beginner

Introduction

Before building trading strategies, you need quality market data. yfinance is a Python library that downloads historical and real-time data from Yahoo Finance for free.

This lesson covers:

  • Installing and using yfinance
  • Downloading historical OHLCV data
  • Multiple tickers and timeframes
  • Handling missing data and errors
  • Data validation best practices

Installing yfinance

Install yfinance using pip:

python
# Install yfinance
!pip install yfinance

# Import required libraries
import yfinance as yf
import pandas as pd
import numpy as np
from datetime import datetime, timedelta

print("yfinance installed successfully!")

Basic Data Download

The simplest way to get data is using yf.download():

python
# Download Apple stock data
aapl = yf.download('AAPL', start='2023-01-01', end='2024-01-01', progress=False)

print("Downloaded data shape:", aapl.shape)
print("\nFirst 5 rows:")
print(aapl.head())

print("\nColumn names:")
print(aapl.columns.tolist())

Understanding Adjusted Close

Adj Close (Adjusted Close) accounts for corporate actions:

  • Stock splits: 2-for-1 split makes historical prices half
  • Dividends: Dividend payments reduce stock value
  • Rights offerings: New shares issued

Use Adj Close for returns calculations, not raw Close, to get accurate historical performance.

python
# Compare Close vs Adj Close
comparison = aapl[['Close', 'Adj Close']].copy()
comparison['Difference'] = aapl['Close'] - aapl['Adj Close']
comparison['Diff_Pct'] = (comparison['Difference'] / aapl['Close']) * 100

print("Close vs Adj Close comparison:")
print(comparison.tail())

print(f"\nAverage difference: {comparison['Diff_Pct'].mean():.3f}%")

Always use Adj Close when calculating returns or comparing prices across time. The difference compounds - a 0.42% per-period difference becomes 10%+ over years.

Downloading Multiple Tickers

Download multiple stocks at once by passing a list:

python
# Download multiple tech stocks
tickers = ['AAPL', 'MSFT', 'GOOGL', 'AMZN']
data = yf.download(tickers, start='2023-01-01', end='2024-01-01', progress=False)

print("Data structure for multiple tickers:")
print("Shape:", data.shape)
print("\nColumns (MultiIndex):")
print(data.columns)

# Access specific ticker's close prices
aapl_close = data['Close']['AAPL']
print("\nAAPL Close prices (first 5):")
print(aapl_close.head())

# Compare all tickers' close prices
print("\nAll tickers' Close prices:")
print(data['Close'].tail())

Different Timeframes

yfinance supports multiple timeframes using the interval parameter:

Available Intervals
Interval
Description
Max History
Use Case
1m1 minute7 daysIntraday scalping
5m5 minutes60 daysDay trading
15m15 minutes60 daysDay trading
1h1 hour730 daysSwing trading entries
1d1 day (default)All historyPosition trading
1wk1 weekAll historyLong-term analysis
1mo1 monthAll historyMacro trends
python
# Download hourly data (last 5 days)
hourly = yf.download('AAPL', period='5d', interval='1h', progress=False)

print("Hourly data:")
print(f"Shape: {hourly.shape}")
print(f"Bars per day: ~{hourly.shape[0] / 5:.0f}")
print("\nFirst few hours:")
print(hourly.head())

# Download weekly data
weekly = yf.download('AAPL', start='2020-01-01', end='2024-01-01', interval='1wk', progress=False)

print(f"\nWeekly data shape: {weekly.shape}")
print("Last 5 weeks:")
print(weekly.tail())

Using Period Shortcuts

Instead of specifying start/end dates, use period shortcuts:

python
# Valid period values: 1d, 5d, 1mo, 3mo, 6mo, 1y, 2y, 5y, 10y, ytd, max

# Last 6 months
data_6mo = yf.download('AAPL', period='6mo', progress=False)
print(f"Last 6 months: {data_6mo.shape[0]} days")

# Year to date
data_ytd = yf.download('AAPL', period='ytd', progress=False)
print(f"Year to date: {data_ytd.shape[0]} days")

# Maximum available history
data_max = yf.download('AAPL', period='max', progress=False)
print(f"All history: {data_max.shape[0]} days")
print(f"First date: {data_max.index[0]}")
print(f"Last date: {data_max.index[-1]}")

Handling Missing Data

Real-world data has gaps (holidays, trading halts, delisting). Always check and handle missing data:

python
# Download data
data = yf.download('AAPL', start='2023-01-01', end='2024-01-01', progress=False)

# Check for missing values
print("Missing values check:")
print(data.isnull().sum())

# Check for any rows with NaN
rows_with_nan = data[data.isnull().any(axis=1)]
print(f"\nRows with NaN: {len(rows_with_nan)}")

# Check data continuity (look for large gaps)
data['Date_Diff'] = data.index.to_series().diff().dt.days

large_gaps = data[data['Date_Diff'] > 5]
print(f"\nGaps larger than 5 days: {len(large_gaps)}")
if len(large_gaps) > 0:
    print(large_gaps[['Close', 'Date_Diff']])

# Forward fill missing data (use previous day's close)
data_filled = data.fillna(method='ffill')

print("\nData completeness:")
print(f"Original: {data.shape[0]} rows")
print(f"After filling: {data_filled.shape[0]} rows")

Important: For less liquid stocks or crypto, missing data is common. Always validate data quality before building strategies. A single missing bar can break your backtest.

Data Validation Pipeline

Build a robust data validation function:

python
def validate_ohlcv_data(df, ticker=''):
    """
    Validate OHLCV data quality.

    Checks:
    - Missing values
    - OHLC relationships (H >= L, H >= O/C, L <= O/C)
    - Negative prices
    - Zero volume
    - Outliers (price jumps >50%)

    Returns: dict with validation results
    """
    issues = []

    # 1. Check for missing values
    missing = df.isnull().sum().sum()
    if missing > 0:
        issues.append(f"Missing values: {missing}")

    # 2. Check OHLC relationships
    invalid_hl = (df['High'] < df['Low']).sum()
    if invalid_hl > 0:
        issues.append(f"Invalid H<L: {invalid_hl} bars")

    invalid_hc = (df['High'] < df['Close']).sum()
    if invalid_hc > 0:
        issues.append(f"Invalid H<C: {invalid_hc} bars")

    invalid_lc = (df['Low'] > df['Close']).sum()
    if invalid_lc > 0:
        issues.append(f"Invalid L>C: {invalid_lc} bars")

    # 3. Check for negative or zero prices
    negative_prices = (df[['Open', 'High', 'Low', 'Close']] <= 0).any(axis=1).sum()
    if negative_prices > 0:
        issues.append(f"Negative/zero prices: {negative_prices} bars")

    # 4. Check for zero volume
    zero_volume = (df['Volume'] == 0).sum()
    if zero_volume > 0:
        issues.append(f"Zero volume: {zero_volume} bars")

    # 5. Check for extreme jumps (>50% in one day)
    returns = df['Close'].pct_change()
    extreme_moves = (abs(returns) > 0.5).sum()
    if extreme_moves > 0:
        issues.append(f"Extreme moves (>50%): {extreme_moves} bars")

    # Summary
    result = {
        'ticker': ticker,
        'bars': len(df),
        'date_range': f"{df.index[0].date()} to {df.index[-1].date()}",
        'valid': len(issues) == 0,
        'issues': issues
    }

    return result

# Test validation
aapl = yf.download('AAPL', period='1y', progress=False)
validation = validate_ohlcv_data(aapl, 'AAPL')

print("Data Validation Report:")
print(f"Ticker: {validation['ticker']}")
print(f"Bars: {validation['bars']}")
print(f"Date Range: {validation['date_range']}")
print(f"Valid: {validation['valid']}")
if not validation['valid']:
    print(f"Issues found:")
    for issue in validation['issues']:
        print(f"  - {issue}")
else:
    print("✅ All validations passed!")

Saving and Loading Data

Save downloaded data to avoid repeated downloads:

python
# Download and save
data = yf.download('AAPL', period='max', progress=False)

# Save to CSV
data.to_csv('aapl_historical.csv')
print(f"Saved {len(data)} rows to aapl_historical.csv")

# Save to pickle (faster, preserves dtypes)
data.to_pickle('aapl_historical.pkl')
print(f"Saved to pickle format")

# Load from CSV
loaded_csv = pd.read_csv('aapl_historical.csv', index_col=0, parse_dates=True)
print(f"\nLoaded from CSV: {loaded_csv.shape}")

# Load from pickle
loaded_pkl = pd.read_pickle('aapl_historical.pkl')
print(f"Loaded from pickle: {loaded_pkl.shape}")

# Compare load times (pickle is much faster for large datasets)
import time

start = time.time()
_ = pd.read_csv('aapl_historical.csv', index_col=0, parse_dates=True)
csv_time = time.time() - start

start = time.time()
_ = pd.read_pickle('aapl_historical.pkl')
pkl_time = time.time() - start

print(f"\nLoad time comparison:")
print(f"CSV: {csv_time:.4f} seconds")
print(f"Pickle: {pkl_time:.4f} seconds")
print(f"Speedup: {csv_time/pkl_time:.1f}x")

Pro tip: Use pickle for fast loading during development, but save to CSV for long-term storage and portability (pickle format can change between pandas versions).

Summary

Key Takeaways

  1. yfinance provides free access to Yahoo Finance data for stocks, ETFs, forex, crypto
  2. Always use Adj Close for return calculations to account for splits and dividends
  3. Multiple timeframes available: 1m to 1mo intervals with varying history limits
  4. Validate data quality: Check for missing values, invalid OHLC relationships, and outliers
  5. Save data locally in pickle format for fast loading during development
  6. Handle missing data using forward-fill or other appropriate methods

Next Steps

You now have the data pipeline skills to download market data. Next, we move to Stage II: Price Action & Market Structure, starting with candlestick analysis: learning to interpret candlesticks as compressed order flow.