Why I'm Learning Statistics, Again
I took statistics in university. I got a decent grade. I remembered almost none of it within two years.
This is, I’ve since learned, entirely normal. Statistics taught as a series of named tests — t-test here, chi-squared there, ANOVA if you’re lucky — mostly doesn’t stick. You pass the exam and then you forget which test applies when, because you never had to actually choose.
Now I’m learning it again, properly this time, because I want to do data work. And the experience is different enough that I think it’s worth writing about.
The Difference Between Knowing and Doing
There’s a gap I keep noticing between people who “know statistics” — can name the distributions, recite the assumptions — and people who “do statistics” — know which tool to reach for when faced with a real messy dataset.
The gap is embarrassingly wide. And it’s not always the formally trained people who sit on the doing side.
What the doing requires is: a sense of when your data meets the assumptions of a method, and what happens to your conclusions when it doesn’t. That’s not something you get from textbook examples. It comes from working with real data and being wrong a few times.
What I’m Actually Doing Differently
I’m working through The Elements of Statistical Learning. It’s dense and it doesn’t hold your hand, which is why it’s useful. But more than any book, what’s changed is that I’m applying things immediately.
I have a small dataset of my own running data — pace, heart rate, elevation, conditions. Unremarkable data, but it’s mine, and I care about whether my conclusions are right. When I learn a technique, I try to apply it to that data first.
This sounds obvious. I don’t think I would have done it five years ago. I would have read, taken notes, and moved on. The practice of “does this actually work on something real” is, I think, what makes it stick.
What’s Hard
Probability as a foundation. I can compute probabilities mechanically. Understanding why Bayesian inference works, why we treat the likelihood the way we do, why the confidence interval means what it means and not what I thought it meant — that’s harder.
I think this is where formal training helps and self-teaching is slower. The notation is dense and the intuitions are subtle. I’m going slowly.
What’s Surprisingly Easy
The software. Python, R, the ecosystem — it’s remarkably accessible. The hard part isn’t running a regression. The hard part is understanding what the output means and whether you’re asking the right question.
Engineers often get this backwards. We go looking for tools first. But the tool is the easy part.
Why It Matters
I want to be someone who can look at a business question, figure out what data might answer it, collect or access that data, analyse it correctly, and communicate the result in a way that changes someone’s decision.
That’s a long chain. Statistics is one link in it. But it’s the link I’ve been missing.