The Beginner’s Guide to Sample Size

April 3, 2015

A baseball season is the amalgamation of a lot of little events. Each pitch fits into a plate appearance which fits into an inning which fits into a game which fits into a series which fits into a season. That’s a lot of little data points flowing into an overall end result. We care a lot about which players will have good seasons and careers. It matters to us that we can distinguish between good players and bad players, but doing so requires that we understand which chunks of data are meaningful and which aren’t.

Enter sample size. You’ve heard this phrase plenty over the last few years when talking about baseball statistics and it’s usually a conversation ended rather than a conversation started. Someone cites a stat and then another person says it doesn’t matter because the sample size is too small. What does that mean and how should we properly think about sample size in baseball?

Each little moment in baseball is essentially random. Not random in the sense that all outcomes are equally likely, but random in the sense that the most likely outcome doesn’t happen every time. If the best hitter in baseball faced the worst pitcher 100 times, he would very likely strike out a couple of times and hit into a double play or two. He wouldn’t always hit a home run even if it was Coors Field and the pitcher was throwing meatballs. Think about the home run derby. MLB players can’t simply hit home runs on demand even when the pitcher is trying to help.

When dealing with pitches flying 90+ miles per hour and split second movements, a whole bunch of randomness gets thrown into the pot. This means that any one plate appearance might have a funky result. You know this. One time Don Kelly took Yu Darvish deep.

So of course, we know that a single plate appearance isn’t a convincing amount of data. Even the least sabermetrically minded person agrees with that concept. That single plate appearance is an valid data point, but it’s not enough information to inform your opinion very fully. Instead, you need more and more data points until you have enough for them to “stabilize.” Remember that word because the way we’re going to define it in a very specific way in a moment.

Essentially, we want to make sure we have enough observations that the random noise gets cancelled out. Don Kelly hit a home run against Yu Darvish one time, but how many Kelly versus Davrish at bats do we need before we can accurately access their abilities? It’s more than one for sure, but the actual number you need depends on the skill you’re trying to analyze.

For example, there are some skills that are more “stable” than others. For example, strikeout rate stabilizes in fewer than 100 PA while BABIP for a pitcher can take three years. The difference is the nature of the skill and the number of factors that influence the outcome of the play. With respect to strikeout rate, we’re only talking about the batter and pitcher’s ability to make or allow contact (or let strikes go by). When you’re talking about BABIP, you’re adding in quality of contact, direction, weather, defensive ability, luck, etc. That means there’s more room for noise and things with less noise in the actual data generating process stabilize more quickly.

So let’s go back to this idea of stabilization. Conceptually, it’s an ironclad idea. You want to know how many data points it takes for the current information to provide an accurate assessment of the player in question. There’s no one point at which something stabilizes. Things become stable over time at a given speed. So after five PA, you know more about a hitter’s walk rate than after one PA, but you don’t know as much as you do after 150 PA or 600 PA. A statistic doesn’t stabilize, it becomes more stable.

In baseball, we lean on some work by Russell Carleton (aka Pizza Cutter), who looked to see how many PA you need for a given statistic to reach the point where the correlation between that sample and another sample of the same size is 0.7 (i.e. R^2 of .49). That’s the colloquial definition of stabilize. So the rates you see on this page reflect that.

But the key is that 100 PA is better than 50 PA no matter the statistic, but 50 PA is more useful for plate discipline stuff than it is for batted ball stuff. The rates are different, but it’s always better to have more data.

For practical purposes, you really want to know the difference between a sample that’s meaningful and one that isn’t. There isn’t a point at which it becomes useful data all of a sudden, but there are quantities that are clearly one or the other. This is going to be important when the season starts next week.

Every April, at least one previously bad hitter has an awesome month. They have a .380 wOBA over three weeks and lots of people rush to suggest they are a breakout candidate who did something during the offseason to improve. It’s important to note that this may be true or it may not be true. All we know is that they hit .380 wOBA over three weeks, let’s call it 85 PA.

Are those 85 PA enough to lead us to totally change our opinion about this bad hitter to the point where we now think they are fundamentally different in the box? Using our sample size rules of thumb, the answer is no. A bad hitter can easily have a .380 wOBA over 85 PA without actually being a different hitter, just due to random chance. A couple lucky bounces and a well timed cluster of hits and his numbers look great even if he’s no different than he was before.

Those 85 PA give you some idea that he might be improving, but they are not sufficient to change your mind completely. A true .380 wOBA hitter should hit .380 over more stretches than a .310 wOBA hitter, but a .310 wOBA hitter can hit .380 over a month no problem.

Think of it this way. A true talent .300 hitter might go 3-10 over a stretch or they might go 6-10. That wouldn’t be strange at all and you wouldn’t change your mind about a hitter over 10 PA. The same is true for 50 or 100 with most stats. It seems meaningful early in the year when you don’t have other fresh data, but it’s not.

This isn’t to say that streaky hitters don’t exists or that “hot-hand” is a fallacy. That’s a separate issue. This is an argument, backed by extensive data, that a collection of 40 PA is not more meaningful than the 500 that came before.

This is tricky to internalize because when a player has success, you want to find a reason other than randomness because randomness is not an easy thing for the human mind to handle. But in many cases, it’s the right answer. Each player’s set of outcomes is drawn from a probability distribution around their true talent level. Sometimes those talent levels change, but you need a lot of evidence to believe that is the case. It’s very possible that a good hitter has a bad set of results over a short stretch just based on random stuff happening.

So try not to make too much of the April results. You can definitely look at the underlying performance, but don’t make too much of the end product. A player might be hitting the ball harder this April and that’s a sign of a new swing, but just the fact that they have a few more hits than normal doesn’t mean that’s going to continue.

BAL	CHW	ATH
BOS	CLE	HOU
NYY	DET	LAA
TBR	KCR	SEA
TOR	MIN	TEX

ATL	CHC	ARI
MIA	CIN	COL
NYM	MIL	LAD
PHI	PIT	SDP
WSN	STL	SFG