Kekkei - Advancing Financial Science For Everyone

Statistics Without Formulas

Imagine you have a small sample and want a confidence interval, but:

The distribution isn't normal
You don't know which formula to use
The statistic is weird (median, correlation, ratio)

Bootstrapping says: use the data to simulate what would happen if you could resample from the population. It's one of the most powerful techniques in modern statistics.

The Bootstrap Idea

Resampling with replacement from your observed data to estimate the sampling distribution of a statistic.

The key insight: your sample is the best estimate of the population you have. So treat your sample as if it were the population and resample from it.

The Bootstrap Process

You have n = 10 observations: [5, 7, 3, 9, 6, 8, 7, 5, 6, 4]

Want to estimate the standard error of the median.

Bootstrap procedure:

Resample 10 values with replacement → [7, 5, 7, 9, 5, 5, 8, 6, 7, 3]
Calculate median of this bootstrap sample → 6.5
Repeat 1,000 times → 1,000 bootstrap medians
Standard deviation of those 1,000 medians ≈ standard error of the median

Why it works: By resampling with replacement, you're simulating what other samples from the population might look like.

Bootstrap Confidence Intervals

The percentile method is the simplest bootstrap CI:

Generate 10,000 bootstrap samples
Calculate your statistic for each
Take the 2.5th and 97.5th percentiles of the bootstrap distribution
These form your 95% confidence interval

Bootstrap CI for Median

Original sample median: 6

Bootstrap medians (10,000 resamples): [4.5, 5.0, 5.5, 6.0, 6.0, 6.5, 7.0, 7.5, 8.0, ...]

Sort them and find:

2.5th percentile: 5.1
97.5th percentile: 7.4

95% Bootstrap CI: [5.1, 7.4]

No formula needed! The data and resampling did the work.

Beautiful property: This works for ANY statistic — median, correlation, ratio, standard deviation, whatever. No need to derive formulas.

Why With Replacement?

With replacement means after picking a value, you put it back. The same observation can appear multiple times in a bootstrap sample.

Why? Without replacement, you'd just get your original sample reordered — no new information.

With replacement creates variability that mimics sampling from a population.

Illustration

Sample: [3, 5, 7]

Possible bootstrap samples WITH replacement:

[3, 3, 3]
[3, 5, 7] (original)
[7, 7, 5]
[3, 7, 3]
... many possibilities!

Possible bootstrap samples WITHOUT replacement:

[3, 5, 7]
[3, 7, 5]
[5, 3, 7]
Only 6 permutations — not enough variability!

When to Use Bootstrap

Use bootstrapping when:

1. No formula exists Want SE or CI for the median, trimmed mean, correlation ratio, or some custom statistic? Bootstrap handles it.

2. Complex statistics Regression diagnostics, model performance metrics — bootstrap works where theory is hard.

3. Small to moderate samples Parametric assumptions questionable but sample too small for CLT. Bootstrap provides valid inference.

4. Non-normal data Distribution is skewed, has outliers — bootstrap doesn't care about the shape.

Don't use bootstrap when:

Sample is truly tiny (n < 10) — not enough data to resample from
Data has clear structure you're ignoring (time series, clusters) — need specialized bootstrap
Parametric methods work fine — bootstrap is computationally expensive

How Many Bootstrap Samples?

For standard errors: B = 1,000 is usually enough
For confidence intervals: B = 5,000-10,000 recommended
For very precise work: B = 50,000+

More is better, but returns diminish. The key is: randomness in bootstrap should be small compared to sampling variability in your original data.

Modern computing makes this trivial. What would have been impossible 30 years ago runs in seconds today. The bootstrap democratized statistics.

Bootstrap vs Theory

Theoretical approach (traditional):

Assume distribution (normal, etc.)
Derive formula for SE/CI
Apply formula

Pros: Fast, elegant, exact if assumptions hold
Cons: Requires assumptions, limited to specific statistics

Bootstrap approach (modern):

Make minimal assumptions
Let the computer resample
Empirically estimate SE/CI

Pros: Works for any statistic, fewer assumptions, intuitive
Cons: Computationally intensive, still needs reasonable sample size

Reality: Use both! If they agree, great. If they disagree, bootstrap is more robust.

Limitations

1. Bootstrap can't create information that isn't there If your sample is biased or unrepresentative, bootstrap won't fix it. Garbage in, garbage out.

2. Rare events If something has probability 1%, you need >100 observations to see it. Bootstrap can't estimate what it's never seen.

3. Not magic Bootstrap assumes your sample is representative. It estimates sampling variability, not bias.

4. Dependent data requires specialized methods Time series, clustered data — use block bootstrap or cluster bootstrap, not naive resampling.

Test your knowledge

🧠 Knowledge Check

1 / 5

Statistics Without FormulasFocusStart Focus Mode

The Bootstrap IdeaFocusStart Focus Mode

Bootstrap Confidence IntervalsFocusStart Focus Mode

Why With Replacement?FocusStart Focus Mode

When to Use BootstrapFocusStart Focus Mode

How Many Bootstrap Samples?FocusStart Focus Mode

Bootstrap vs TheoryFocusStart Focus Mode

LimitationsFocusStart Focus Mode

Test your knowledgeFocusStart Focus Mode

What is the core idea behind bootstrapping?