The Beginner’s Guide to Using Statistics Properly by Neil Weinberg September 15, 2014 We’ve spilled a great deal of virtual ink and audible podcasting words on the nature of Wins Above Replacement (WAR) and defensive metrics recently. Jeff Passan of Yahoo! Sports and many who responded to his critique of the current WAR calculation dug into the relative merits of the metric itself and how well we’ve estimated it to date. That’s a great conversation to have and Dave has done the heavy lifting on behalf of FanGraphs in that regard. I’d like to pivot and discuss a very important point about the use of statistics in baseball: Everything has flaws. Every single statistic is wrong. Your eyes are wrong. It is all wrong. Nothing we have will provide you with perfect information or even truly accurate information with respect to the underlying variables about which you care. You don’t get to choose between flawed and not flawed statistics, you get to choose between useful and not useful statistics. More importantly, statistics become useful based on your awareness of the proper way to wield them. Retrospective or Prospective Questions The order in which you move through this thought process is up to you, but let’s start with the nature of the question itself. You have to decide whether you care about determining how a player performed in the past or how good he’s going to be today, tomorrow, or five years from now. Those are different questions, but we often treat them as if they are the same thing. Want to know who the best hitter in baseball was over the last year? Sort by wRC+ for that time period and you have your answer, right? Not exactly, but we’ll get there in a second. For now, let’s assume that wRC+ over the last year is the true reflection of who was the best hitter. But if you want to know who the best hitter in the game is, sorting by current wRC+ won’t do the trick. There’s no perfect way to answer the prospective question here. You want to use some type of projection that estimates that player’s current talent level based on their performance over multiple years, weighted by how recent it is. That’s a very simple way to define projection. But it also gets more complicated because forecasting future performance comes with uncertainty. This means that even if we have a good projection system, we’re going to be uncertain about the precise talent level of each player. Is Trout a true talent 160 wRC+ or 170? 150? We’re not sure. We’re estimating based on an unknown data generating process. We have many statistics that capture what happened in the past and we often use those statistics to inform what we think will happen in the future. In the MVP debate, we care only about that current season so using statistics that describe the previous season is great. If we want to decide who we should sign in free agency or offer on the trade market, we want to incorporate additional information and attempt to estimate how well players will perform. Those are two different questions and they require two different strategies, but we seem to appreciate that predicting the future includes uncertainty. Is that what really happened? A moment ago, I wrote that a player’s past wRC+ isn’t an exact representation of how well that player performed. On the face of it, my claim sounds strange. The batter accumulated those hits and walks and outs, didn’t they? Of course they did. But that actually neglects the question at hand. The question we’re asking is “who was the best hitter in baseball?” In reality, wRC+ only tells you who had the highest wRC+. That particular statistic is the best estimate we have of offensive performance right now, but it isn’t a measure of truth. A scorching line drive is a single and a dribble that dies in the grass is also a single. Those aren’t the same thing in anyone’s eyes except those of the official scorer. In reality, a decent amount of the outcomes we observe are conditional on many factors outside of the control of the players for which we attribute those outcomes. We do our best to control for those factors, but we miss plenty. Our park factors could be more nuanced. We could control for quality of competition. We could measure performance based on exit velocity and trajectory rather than singles, doubles, and fly outs. We don’t do those things for a variety of reasons. Some of them are impossible with the available data and some are really hard to get right. We don’t know if the player with the highest wRC+ was actually the best hitter, but it’s the best we can do right now. People often complain about the uncertainty and flaws of defensive metrics, but offensive stats have many of the same flaws. You just don’t notice them as much because there’s more data to wash away some of the concerns. BABIP is a great example of this. Hitters can influence their BABIP more than pitchers, but there is still a lot of noise amid that signal. If you get ten seeing-eye singles you probably didn’t hit as well as a guy who hit ten rockets to the left fielder on one hop. We know this intuitively, but we don’t conceptualize it the same way. There is lots of uncertainty in all of our statistics. We’re just used to the offensive uncertainty and mentally regress performance much more easily than we do with new defensive numbers. Proper Usage The key to this entire endeavor is having a clear sense of the question you want to ask and the best way to go about answering it. Think about it this way, do you actually care who leads the league in batting average? Really think about it. You don’t. You may care which hitter is the best at getting on base or providing offensive value, but you don’t actually care about hits divided by at bats. Batting average is supposed to be a tool that tells you how well a player has performed as a hitter. Just like wOBA or wRC+ is supposed to be a tool for the same thing. Leading the league in any of those doesn’t make you the best hitter or the hitter who had the best season, it makes you the person who led the league in a category. You have questions about the game and we have tools that go about answering those questions to the best of our abilities. You can’t get caught up in the raw output of the stats because they don’t tell you anything if you don’t know how to interpret them. Think about WAR. WAR is imperfectly calculated because calculating it perfectly is impossible. We care about trying to uncover who is the all-round most valuable player. WAR is a tool that allows us to work toward an answer. WAR gives us approximations of player value that we can use to separate groups of players. Typically, 6 WAR or higher gets you into the MVP conversation. WAR doesn’t tell you who the MVP was, WAR helps you filter out players who definitely aren’t in the MVP conversation. Just like wOBA helps you determine about how good a hitter has been. A .400 wOBA and a .405 wOBA aren’t different enough to tell you anything. And they especially aren’t different enough to tell you anything over the span of 40 plate appearances. To that end, you need to know the limitations of every statistic you use. We hear a lot about WAR’s flaws. You know what other stats are flawed? Literally all of them. Every single one. You know what else is flawed? The eye test. That’s true if you’re a scout or a casual fan (although the scouts are usually better!) and it’s true if you want fifteen games or 150. In statistics, there’s a pretty common axiom: All models are wrong, but some models are useful. A toy airplane doesn’t do you any good if you want to learn about the way a jet burns fuel, but it’s very useful if you want to get a sense of the relative size of the wings to the propeller. The same is true with baseball stats. WAR is great if you want to get a sense of a player’s overall contribution, but it can’t tell you anything about the competition that player faced (yet, at least), for example. On-base percentage is great at telling you the frequency with which a player reached base, but it doesn’t tell you anything about extra base power. And no retrospective statistic can tell you what’s going to happen in the future, either. We’ve been working toward building better and better models, but we’re not anywhere close to the truth. The flaws in WAR aren’t reasons not to use WAR. They are reasons to use WAR only for what WAR can tell you. The same is true for average and wOBA and strikeout percentage. Everything we have to evaluate baseball is flawed in some way. The only way around this is to understand the flaws and properly account for them by installing measures of uncertainty in everything you do. You’ve known forever that a .301 average and a .299 average are the same thing. You can’t tell the hitters apart and you certainly couldn’t tell me which one is a better hitter from only those two pieces of data. This carries into everything else. Defensive metrics aren’t perfect yet. The way we incorporate defense into WAR likely isn’t perfect, either. But it’s the best we’ve been able to do so far and the absence of perfection does not mean the absence of utility. A stat with flaws is better than nothing at all as long as you are aware that the flaws exist. We’re working toward better measures, but nothing is perfect and everything requires caution. This isn’t just about WAR. It’s about everything we do. Questions, thoughts, comments? Comment below!