Measuring Pitching Value is Complicated
You’re likely aware that there are different versions of Wins Above Replacement (WAR) housed here, at Baseball-Reference, and at Baseball Prospectus (called WARP). For a lot of people, this makes the statistic confusing because it seems like there shouldn’t be multiple ways to calculate something with the same name. To the credit of the critics, somewhere along the way we should have agreed on a way to make it easier to communicate which statistic is which that’s a little more clear than fWAR, rWAR, and WARP, but that’s not the focus of the discussion today.
When it comes to WAR for position players, the differences among the models are less philosophical and more technical. The sites use different defensive components, different base running stats, and a few other differences in the same vein, but the overall approach is pretty much equivalent. The inputs are different, but the different WARs agree on what should be measured. When it comes to pitching, it gets more complicated because what should be measured becomes the debate itself. This article doesn’t intend to tell you which WAR is best, but rather to walk through the decisions that one needs to make when evaluating a pitcher’s value.
It always starts with a question, and in this instance, we want to know how valuable a pitcher is to his team overall. Leaving aside pitchers who bat in the NL, the goal of pitching is to prevent runs. When your team is in the field, the goal is to allow the other team to score the fewest number of runs possible and the pitcher is the central figure in that pursuit.
So the more precise question is how well does a pitcher prevent runs, and then of course, how often does the pitcher do that?
This is a question that sounds easy but winds up being pretty complicated. It seems like you should be able to take the number of runs a pitcher allows and be done with it, but unfortunately pitchers aren’t the only ones involved in that part of the game, so we need some way of isolating the responsibility of the pitcher. In other words, we want to know how many runs the pitcher is responsible for and over how many batters faced or innings pitched.
And this is where the various WARs diverge quickly. The goal of each is to measure the role of the pitcher in run prevention, but they all take a very different approach to doing so. At FanGraphs, we use Fielding Independent Pitching (FIP) as the building block. FIP-based WAR is all about charging the pitcher for the parts of the game we know don’t involve his defense. So fWAR considers strikeouts, walks, hit batters, home runs, and infield flies. There’s more complexity than that, but the basic assumption is that stripping out defense works best when assuming equal results on balls in play.
No one would argue that pitchers have no control on anything that happens on balls in play, but fWAR treats balls in play that way because pitchers don’t have that much control over balls in play and its more accurate to assume average responsibility than to try to carve it up with the available data. Again, everyone will tell you that this is not a perfect way to measure pitcher value, as it chooses to make an assumption about a part of the question we can’t answer very well right now.
At Baseball-Reference, they use runs allowed as the driving force and then they adjust that mark based on the quality of the team’s defense (among other things). In other words, rWAR takes the actual amount of runs allowed and then pushes it in one direction or the other based on what we believe to be the quality of the defense played behind the pitcher. This, again, is not a perfect solution to the problem as pitchers are affected by defense in different ways and the aggregate adjustment made at Baseball-Reference isn’t granular enough to account for that because we’re limited by the available data.
At Baseball Prospectus, their new DRA-based WARP is all about controlling for as much context as possible. They’re basing the measure on run expectancy allowed and then they use an advanced modeling strategy to attempt to control for all kinds of things like defense, catching, umpiring, weather, etc. The potential drawback here is that WARP is overfitting the model and is assuming their adjustments work equally well in all situations.
All of the WAR versions are useful models of pitching value, but as the old adage goes, all of the models are also wrong. Each model makes choices and assumptions that constrict the accuracy of the results, which is true any time you model anything.
But what you observe when looking at the three versions of WAR is that all three take a very different approach to addressing the exact same question. Based on how a pitcher actually pitched, how many runs should they have allowed? What is the pitcher’s specific responsibility? That’s the question at hand.
FanGraphs has a method to answer that question. So does Baseball-Refernece. So does Baseball Prospectus. The issue is that we don’t actually know which method is best. The Baseball Prospectus model is brand new, so some of their technological advances are clear improvements, but their fundamental building blocks are still built on the shaky foundation that all pitching metrics are built upon.
We have to make a lot of choices when we measure pitching. What units matter? Do we judge by the pitch? By the plate appearance? By the inning? By the game? It may seem trivial, but you can strikeout three batters in an inning while facing only those three hitters or you can do it while walking two guys in between. Per inning, that’s a great strikeout rate, but per PA, it’s a little worse. Is a pitcher who gets ahead 0-2 a lot but gives up some weak singles the same as a pitcher who gets behind 2-0 and gives up solid singles? Does it matter as much if a pitcher gives up four runs in a game that his team is winning by 10 as it does when he gives up four runs in a tie game?
Everyone believes in removing defense from the equation, but how can you possibly remove something so closely tied to run prevention? Should you just assume a pitcher can’t control what happens once a ball is put in play? Should you try to adjust their overall runs allowed based on defense? Should you try to control for everything that’s happening when a ball is put into play? How well can we do anything of those things based on the data we have? Would launch angle and velocity help?
How do we adjust for opponent quality? How about park factors? Do the umpire and the catcher matter?
These are all important questions without clear answers. I won’t tell you which WAR is right because none of them are right. They’re the manifestation of different answers to the questions I just posed. They’re different in design so they’re different in the results they offer.
You want to be able to look at a pitcher’s body of work and make some estimate of how many runs he should have allowed relative to his peers in the same context or some generic context. There are a lot of ways to get to an answer.
Imagine a single start in which the pitcher allowed three runs in eight innings. That’s not enough information to tell you how well the pitcher pitched, but the tough question we are all asking is what additional information do we need to properly evaluate the start? We care about the opponent, the park, the defense, the catcher, the umpire, and we also care about how those three runs happened. Did the pitcher walk a lot of hitters? Was the contact hard? Did the ball take a lucky hop or a terrible bounce?
In a perfect world, you would somehow simulate thousands of games at every turn. It’s not just about the runs allowed, controlling for defense. That’s a blunt tool for a very nuanced job. FIP is blunt. DRA is less blunt, but it’s also averaging out a lot of effects that might operate in very complex ways. There’s a lot we don’t know about how to properly credit a pitcher for his performance.
Ultimately, the correct answer is to use each tool as it was intended to be used. Each WAR does a fairly nice job of approximating a pitcher’s value and most of the time they’re close enough that you’re confident you have a sense that a given pitcher has performed in a general tier. Sometimes they disagree and when they disagree it could be because of randomness occurring in the various components of the models or it could be because you’re looking at a pitcher whose skills and value are complicated to understand.
We want to know which pitchers contribute most effectively to run prevention, but due to all of the things outside of a pitcher’s control that affect run prevention, we have to make some decisions about how to isolate the pitcher. At this point, no one has figured out the ideal way to do that. Until someone does, you should learn about the different options and choose the one that you think makes the best set of assumptions according to your definition of pitching value.
Neil Weinberg is the Site Educator at FanGraphs and can be found writing enthusiastically about the Detroit Tigers at New English D. Follow and interact with him on Twitter @NeilWeinberg44.
“its more accurate to assume average responsibility than to try to carve it up with the available data.”
is it? The very premise of DRA is that it’s more accurate to carve it up.
“That’s a blunt tool for a very nuanced job. FIP is blunt. DRA is less blunt, but it’s also averaging out a lot of effects that might operate in very complex ways.”
But FIP is averaging out a lot MORE of those effects. You’re always going to have to average out complex events, but the goal is to get more accurate. So the question is, “Is DRA’s averaging out, more accurate than FIP’s averaging out?” BP has provided a lot of data that says yes. So I’m going to go with BP for now unless you can provide data that shows FIP’s averaging out is better.
The quotation you’re discussing is explaining the assumption of FIP, not making a declaration that such an assumption is correct or best. Just wanted to clarify that part.
so far I like very much the 50/50 approach to pitchers WAR.
I know it is not perfect, nor does it pretend that it is perfect, but I would guess it works as well as any other method so far. so, how come it is not mentiioned in this article? just curious.
g