Capstone Projects

Apply everything you've learned: analyze real datasets, design experiments, debunk misleading claims, and build statistical reports.

30 min read
Advanced

Putting It All Together

You've learned the theory. Now it's time to apply it.

This capstone lesson presents real-world statistical challenges where you'll integrate everything you've learned:

  • Descriptive statistics and visualization
  • Probability and distributions
  • Hypothesis testing and confidence intervals
  • Correlation, regression, and causation
  • Critical thinking and fallacy detection

Choose one or more projects below. Work through them systematically, documenting your reasoning.

What I cannot create, I do not understand.
Richard Feynman

Project 1: Analyze a Real Dataset

Goal: Explore a dataset, summarize findings, and make evidence-based conclusions.

Data sources:

  • Kaggle datasets (kaggle.com/datasets)
  • UCI Machine Learning Repository
  • data.gov (government data)
  • FiveThirtyEight data (github.com/fivethirtyeight/data)
  • Your own data (work, hobby, research)

Required steps:

1. Data Exploration

  • How many observations? Variables?
  • What types of data? (categorical, numerical, time series)
  • Missing values? Outliers?
  • Create histograms, box plots, scatter plots

2. Descriptive Statistics

  • Central tendency: mean, median, mode
  • Spread: standard deviation, IQR
  • Check distribution shape: normal? skewed?

3. Research Question

  • Formulate a specific, testable question
  • Example: "Do SAT scores differ by geographic region?"

4. Statistical Analysis

  • Choose appropriate test (t-test, correlation, regression, chi-square)
  • Check assumptions (normality, independence, sample size)
  • Calculate test statistic and p-value
  • Construct confidence intervals

5. Interpretation

  • What do the results mean in plain English?
  • Statistical vs practical significance?
  • Limitations and confounders?
  • Alternative explanations?

6. Visualization

  • Create publication-quality graphs
  • Tell a story with data
  • Avoid misleading visualizations

7. Write-up

  • Introduction (question and why it matters)
  • Methods (data source, sample size, tests used)
  • Results (numbers, tables, figures)
  • Discussion (interpretation, limitations, future directions)

Bonus: Share your analysis as a blog post, report, or presentation. Explaining to others solidifies understanding.

Project 2: Design and Analyze an A/B Test

Goal: Design an experiment, collect data (or simulate), and analyze results.

Scenario: You run a website and want to increase signups.

Steps:

1. Hypothesis "Changing the signup button from blue to green will increase signup rate."

2. Design

  • Control: Blue button
  • Treatment: Green button
  • Metric: Signup rate (%)
  • Randomization: 50/50 split of visitors

3. Sample Size Calculation

  • Current signup rate: 5%
  • Minimum detectable effect: +1 percentage point (to 6%)
  • Significance level: α = 0.05
  • Power: 0.80
  • Calculate required sample size using formulas or online calculator

4. Data Collection (can simulate)

  • Generate simulated data with realistic effects
  • Include some noise and variability

5. Analysis

  • Calculate signup rates for both groups
  • Two-proportion Z-test
  • Confidence interval for difference
  • Effect size and practical significance

6. Conclusion

  • Is the result statistically significant?
  • Is the improvement worth implementing?
  • Cost-benefit analysis

7. Report

  • Present to stakeholders (pretend)
  • Recommendation: Roll out green button? More testing needed?

Extensions:

  • Multivariate test (test multiple changes simultaneously)
  • Sequential testing (analyze as data arrives)
  • Account for multiple testing if running many tests

Project 3: Debunk a Misleading Claim

Goal: Find a statistical claim in the wild and critically evaluate it.

Sources:

  • News articles
  • Social media
  • Advertisements
  • Political claims
  • Health/wellness products

Required analysis:

1. Identify the Claim "Drinking green tea burns 500 extra calories per day!"

2. Find the Source

  • Original study? Or just marketing?
  • Sample size? Study design?
  • Peer-reviewed? Replicated?

3. Critical Questions

  • Is this correlation or causation?
  • What's the baseline / control group?
  • Relative vs absolute effect?
  • Cherry picking? Publication bias?
  • Funding source / conflicts of interest?
  • Sample representative?

4. Alternative Explanations

  • Confounding variables?
  • Reverse causation?
  • Measurement error?
  • Regression to the mean?

5. Calculate True Effect

  • If they report relative risk, find absolute risk
  • If they say "statistically significant," find effect size
  • Compare claimed effect to plausible reality

6. Write-up

  • Original claim (with source)
  • Your analysis (with evidence)
  • Conclusion: True? Exaggerated? False?
  • Corrected interpretation

Examples to consider:

  • Diet / supplement claims
  • Political polls (methodology and interpretation)
  • Vaccine/health scares
  • Financial advice ("this strategy beats the market")
  • Product effectiveness claims

Project 4: Build a Statistical Report

Goal: Analyze a business problem using statistics and present actionable insights.

Sample scenarios:

Scenario A: Customer Retention

  • Problem: Customer churn increasing
  • Data: Customer demographics, usage patterns, churn status
  • Analysis: What factors predict churn? (logistic regression)
  • Recommendation: Target interventions for high-risk customers

Scenario B: Pricing Optimization

  • Problem: What price maximizes revenue?
  • Data: Historical sales at different prices
  • Analysis: Price elasticity, demand curves, confidence intervals
  • Recommendation: Optimal price range

Scenario C: Quality Control

  • Problem: Defect rate seems high
  • Data: Defect counts over time
  • Analysis: Control charts, hypothesis tests vs target rate
  • Recommendation: Process changes needed?

Required components:

1. Executive Summary

  • Problem statement
  • Key findings (2-3 bullet points)
  • Recommendation (actionable)

2. Data & Methods

  • Data sources and sample size
  • Variables analyzed
  • Statistical methods used
  • Assumptions and limitations

3. Results

  • Tables and visualizations
  • Statistical tests with interpretation
  • Confidence intervals
  • Sensitivity analysis

4. Discussion

  • Practical significance
  • Limitations and caveats
  • Risks and uncertainty
  • Implementation considerations

5. Recommendations

  • Clear, actionable next steps
  • Expected impact (with uncertainty ranges)
  • Monitoring plan

Goal: Write for non-statistical stakeholders. Avoid jargon. Focus on business impact.

Project 5: Statistical Simulation

Goal: Use simulation to understand a statistical concept or solve a problem.

Option A: Central Limit Theorem

  • Start with a non-normal distribution (exponential, uniform, etc.)
  • Sample from it repeatedly, calculate means
  • Plot distribution of means
  • Show how it becomes normal as n increases
  • Demonstrate E[X̄] = μ and SE = σ/√n

Option B: Bootstrap Confidence Intervals

  • Take a sample of data
  • Bootstrap resample 10,000 times
  • Calculate statistic (mean, median, correlation) for each
  • Construct 95% CI using percentile method
  • Compare to theoretical CI

Option C: Power Analysis

  • Simulate studies with varying sample sizes
  • Generate data under H₁ (effect exists)
  • Test H₀ (no effect)
  • Calculate proportion of times p < 0.05 (power)
  • Show how power increases with n and effect size

Option D: P-Hacking Demonstration

  • Simulate 20 studies testing a null effect
  • Show that ~1 will have p < 0.05 by chance
  • Demonstrate dangers of selective reporting

Option E: Type I vs Type II Errors

  • Simulate data under H₀ and H₁
  • Vary significance level (α) and sample size
  • Calculate Type I error rate (false positives)
  • Calculate Type II error rate (false negatives)
  • Show the tradeoff

Tools: Python (NumPy, SciPy, Matplotlib), R, or even Excel with random number generation.

Deliverable: Annotated code + visualizations + explanation of what you learned.

Evaluation Rubric

What Makes a Great Capstone?
Dimension
Excellent
Poor
Statistical RigorAppropriate methods, checks assumptions, acknowledges limitationsWrong test, ignores violations, overconfident
Critical ThinkingQuestions assumptions, considers alternatives, separates correlation from causationTakes data at face value, jumps to conclusions
CommunicationClear explanations, effective visualizations, tells a storyJargon-heavy, confusing graphs, no narrative
Practical RelevanceActionable insights, considers real-world constraintsPurely academic, impractical recommendations
HonestyReports uncertainty, acknowledges what can't be concludedHides limitations, overstates certainty

Next Steps After This Course

You've built a solid foundation in statistics. Where to go from here?

Deepen your knowledge:

  • Bayesian statistics: Probability as degree of belief
  • Machine learning: Prediction and pattern recognition
  • Causal inference: Going beyond correlation
  • Time series forecasting: ARIMA, exponential smoothing
  • Survey design and sampling: How to collect data properly
  • Experimental design: Factorial designs, blocking, interactions

Practice continuously:

  • Analyze datasets regularly (Kaggle, personal projects)
  • Read research papers critically
  • Follow data science blogs and journals
  • Participate in competitions (Kaggle, DrivenData)

Learn tools:

  • Python: pandas, NumPy, SciPy, statsmodels, scikit-learn
  • R: tidyverse, ggplot2, statistical tests
  • SQL: Data extraction and manipulation
  • Visualization: Tableau, matplotlib, seaborn, ggplot2

Teach others:

  • Explain concepts to reinforce understanding
  • Write blog posts or tutorials
  • Help peers with statistics questions

Stay skeptical:

  • Question statistical claims
  • Look for fallacies in the wild
  • Demand evidence and good methodology
  • Update beliefs based on evidence

The journey continues. Statistics is a skill that deepens with practice and never stops being useful.

All models are wrong, but some are useful.
George Box

Congratulations on completing the statistics course!

You now possess a rare and valuable skill: the ability to think clearly about uncertainty, make evidence-based decisions, and spot statistical nonsense.

Use these skills wisely. The world needs more statistically literate people.

Test your knowledge

🧠 Knowledge Check
1 / 5

When analyzing a real dataset, what should you do FIRST?