Fetching Market Data with yfinance
Learn to download, validate, and manage financial data using Python and yfinance
Introduction
Before building trading strategies, you need quality market data. yfinance is a Python library that downloads historical and real-time data from Yahoo Finance for free.
This lesson covers:
- Installing and using yfinance
- Downloading historical OHLCV data
- Multiple tickers and timeframes
- Handling missing data and errors
- Data validation best practices
Installing yfinance
Install yfinance using pip:
# Install yfinance
!pip install yfinance
# Import required libraries
import yfinance as yf
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
print("yfinance installed successfully!")Basic Data Download
The simplest way to get data is using yf.download():
# Download Apple stock data
aapl = yf.download('AAPL', start='2023-01-01', end='2024-01-01', progress=False)
print("Downloaded data shape:", aapl.shape)
print("\nFirst 5 rows:")
print(aapl.head())
print("\nColumn names:")
print(aapl.columns.tolist())Understanding Adjusted Close
Adj Close (Adjusted Close) accounts for corporate actions:
- Stock splits: 2-for-1 split makes historical prices half
- Dividends: Dividend payments reduce stock value
- Rights offerings: New shares issued
Use Adj Close for returns calculations, not raw Close, to get accurate historical performance.
# Compare Close vs Adj Close
comparison = aapl[['Close', 'Adj Close']].copy()
comparison['Difference'] = aapl['Close'] - aapl['Adj Close']
comparison['Diff_Pct'] = (comparison['Difference'] / aapl['Close']) * 100
print("Close vs Adj Close comparison:")
print(comparison.tail())
print(f"\nAverage difference: {comparison['Diff_Pct'].mean():.3f}%")Always use Adj Close when calculating returns or comparing prices across time. The difference compounds - a 0.42% per-period difference becomes 10%+ over years.
Downloading Multiple Tickers
Download multiple stocks at once by passing a list:
# Download multiple tech stocks
tickers = ['AAPL', 'MSFT', 'GOOGL', 'AMZN']
data = yf.download(tickers, start='2023-01-01', end='2024-01-01', progress=False)
print("Data structure for multiple tickers:")
print("Shape:", data.shape)
print("\nColumns (MultiIndex):")
print(data.columns)
# Access specific ticker's close prices
aapl_close = data['Close']['AAPL']
print("\nAAPL Close prices (first 5):")
print(aapl_close.head())
# Compare all tickers' close prices
print("\nAll tickers' Close prices:")
print(data['Close'].tail())Different Timeframes
yfinance supports multiple timeframes using the interval parameter:
Interval | Description | Max History | Use Case |
|---|---|---|---|
| 1m | 1 minute | 7 days | Intraday scalping |
| 5m | 5 minutes | 60 days | Day trading |
| 15m | 15 minutes | 60 days | Day trading |
| 1h | 1 hour | 730 days | Swing trading entries |
| 1d | 1 day (default) | All history | Position trading |
| 1wk | 1 week | All history | Long-term analysis |
| 1mo | 1 month | All history | Macro trends |
# Download hourly data (last 5 days)
hourly = yf.download('AAPL', period='5d', interval='1h', progress=False)
print("Hourly data:")
print(f"Shape: {hourly.shape}")
print(f"Bars per day: ~{hourly.shape[0] / 5:.0f}")
print("\nFirst few hours:")
print(hourly.head())
# Download weekly data
weekly = yf.download('AAPL', start='2020-01-01', end='2024-01-01', interval='1wk', progress=False)
print(f"\nWeekly data shape: {weekly.shape}")
print("Last 5 weeks:")
print(weekly.tail())Using Period Shortcuts
Instead of specifying start/end dates, use period shortcuts:
# Valid period values: 1d, 5d, 1mo, 3mo, 6mo, 1y, 2y, 5y, 10y, ytd, max
# Last 6 months
data_6mo = yf.download('AAPL', period='6mo', progress=False)
print(f"Last 6 months: {data_6mo.shape[0]} days")
# Year to date
data_ytd = yf.download('AAPL', period='ytd', progress=False)
print(f"Year to date: {data_ytd.shape[0]} days")
# Maximum available history
data_max = yf.download('AAPL', period='max', progress=False)
print(f"All history: {data_max.shape[0]} days")
print(f"First date: {data_max.index[0]}")
print(f"Last date: {data_max.index[-1]}")Handling Missing Data
Real-world data has gaps (holidays, trading halts, delisting). Always check and handle missing data:
# Download data
data = yf.download('AAPL', start='2023-01-01', end='2024-01-01', progress=False)
# Check for missing values
print("Missing values check:")
print(data.isnull().sum())
# Check for any rows with NaN
rows_with_nan = data[data.isnull().any(axis=1)]
print(f"\nRows with NaN: {len(rows_with_nan)}")
# Check data continuity (look for large gaps)
data['Date_Diff'] = data.index.to_series().diff().dt.days
large_gaps = data[data['Date_Diff'] > 5]
print(f"\nGaps larger than 5 days: {len(large_gaps)}")
if len(large_gaps) > 0:
print(large_gaps[['Close', 'Date_Diff']])
# Forward fill missing data (use previous day's close)
data_filled = data.fillna(method='ffill')
print("\nData completeness:")
print(f"Original: {data.shape[0]} rows")
print(f"After filling: {data_filled.shape[0]} rows")Important: For less liquid stocks or crypto, missing data is common. Always validate data quality before building strategies. A single missing bar can break your backtest.
Data Validation Pipeline
Build a robust data validation function:
def validate_ohlcv_data(df, ticker=''):
"""
Validate OHLCV data quality.
Checks:
- Missing values
- OHLC relationships (H >= L, H >= O/C, L <= O/C)
- Negative prices
- Zero volume
- Outliers (price jumps >50%)
Returns: dict with validation results
"""
issues = []
# 1. Check for missing values
missing = df.isnull().sum().sum()
if missing > 0:
issues.append(f"Missing values: {missing}")
# 2. Check OHLC relationships
invalid_hl = (df['High'] < df['Low']).sum()
if invalid_hl > 0:
issues.append(f"Invalid H<L: {invalid_hl} bars")
invalid_hc = (df['High'] < df['Close']).sum()
if invalid_hc > 0:
issues.append(f"Invalid H<C: {invalid_hc} bars")
invalid_lc = (df['Low'] > df['Close']).sum()
if invalid_lc > 0:
issues.append(f"Invalid L>C: {invalid_lc} bars")
# 3. Check for negative or zero prices
negative_prices = (df[['Open', 'High', 'Low', 'Close']] <= 0).any(axis=1).sum()
if negative_prices > 0:
issues.append(f"Negative/zero prices: {negative_prices} bars")
# 4. Check for zero volume
zero_volume = (df['Volume'] == 0).sum()
if zero_volume > 0:
issues.append(f"Zero volume: {zero_volume} bars")
# 5. Check for extreme jumps (>50% in one day)
returns = df['Close'].pct_change()
extreme_moves = (abs(returns) > 0.5).sum()
if extreme_moves > 0:
issues.append(f"Extreme moves (>50%): {extreme_moves} bars")
# Summary
result = {
'ticker': ticker,
'bars': len(df),
'date_range': f"{df.index[0].date()} to {df.index[-1].date()}",
'valid': len(issues) == 0,
'issues': issues
}
return result
# Test validation
aapl = yf.download('AAPL', period='1y', progress=False)
validation = validate_ohlcv_data(aapl, 'AAPL')
print("Data Validation Report:")
print(f"Ticker: {validation['ticker']}")
print(f"Bars: {validation['bars']}")
print(f"Date Range: {validation['date_range']}")
print(f"Valid: {validation['valid']}")
if not validation['valid']:
print(f"Issues found:")
for issue in validation['issues']:
print(f" - {issue}")
else:
print("✅ All validations passed!")Saving and Loading Data
Save downloaded data to avoid repeated downloads:
# Download and save
data = yf.download('AAPL', period='max', progress=False)
# Save to CSV
data.to_csv('aapl_historical.csv')
print(f"Saved {len(data)} rows to aapl_historical.csv")
# Save to pickle (faster, preserves dtypes)
data.to_pickle('aapl_historical.pkl')
print(f"Saved to pickle format")
# Load from CSV
loaded_csv = pd.read_csv('aapl_historical.csv', index_col=0, parse_dates=True)
print(f"\nLoaded from CSV: {loaded_csv.shape}")
# Load from pickle
loaded_pkl = pd.read_pickle('aapl_historical.pkl')
print(f"Loaded from pickle: {loaded_pkl.shape}")
# Compare load times (pickle is much faster for large datasets)
import time
start = time.time()
_ = pd.read_csv('aapl_historical.csv', index_col=0, parse_dates=True)
csv_time = time.time() - start
start = time.time()
_ = pd.read_pickle('aapl_historical.pkl')
pkl_time = time.time() - start
print(f"\nLoad time comparison:")
print(f"CSV: {csv_time:.4f} seconds")
print(f"Pickle: {pkl_time:.4f} seconds")
print(f"Speedup: {csv_time/pkl_time:.1f}x")Pro tip: Use pickle for fast loading during development, but save to CSV for long-term storage and portability (pickle format can change between pandas versions).
Summary
Key Takeaways
- yfinance provides free access to Yahoo Finance data for stocks, ETFs, forex, crypto
- Always use Adj Close for return calculations to account for splits and dividends
- Multiple timeframes available: 1m to 1mo intervals with varying history limits
- Validate data quality: Check for missing values, invalid OHLC relationships, and outliers
- Save data locally in pickle format for fast loading during development
- Handle missing data using forward-fill or other appropriate methods
Next Steps
You now have the data pipeline skills to download market data. Next, we move to Stage II: Price Action & Market Structure, starting with candlestick analysis: learning to interpret candlesticks as compressed order flow.