PitchingBot Pitch Modeling Primer
Introduction
The wealth of Statcast data that has become available to the public has led to a plethora of new ways to analyze players. We can now create statistics that are more indicative of a player’s approach and process rather than purely relying on outcomes.
Expected statistics such as xBA and xwOBA have entered mainstream discourse, with fans understanding that these can sometimes be more useful than the results on the field. A hitter may barrel the ball for a deep fly out, but that barrel indicates that better results are likely to come in the future. Similarly, a pitcher can throw a great pitch that gets hit for a home run, but that great pitch may be indicative of future success.
xBA takes a batted ball’s exit velocity and launch angle and uses a model to produce the probability of a hit. These pitch quality grades take this a step further back, using characteristics of individual pitches to produce an assessment of pitcher quality, with no reference to any outcomes after the ball is thrown.
A number of independently derived pitch quality metrics have recently cropped up in public analysis. I developed this model over the past couple of years under the name PitchingBot, and used to host these grades on my website.
Why Model Pitch Quality?
Before going into too much detail, I should explain why these are useful statistics to make. By removing all references to outcomes, these grades are much more stable than other pitching statistics. This means that it is possible to make quicker judgments on player quality over smaller sample sizes. There is a wide range of features that an analyst must pay attention to when assessing pitcher quality, including velocity, movement, spin rate, release height and extension, spin/movement axis deviation, and location, among others. It can be tough to combine all of this information inside your head, but these models can weigh everything appropriately and distill it into a single number representing overall quality.
There’s a reason that most, if not all, major league organizations have their own pitch quality models. These models’ outputs are driving pitch usage changes, and analytically-driven teams are some of the best at improving the pitch quality of players that they acquire. There are also applications beyond measuring the quality of major league pitchers. Pitch quality models can help to develop minor league players and measure their ability without needing to face big league hitters. They can also inform decisions on where pitches should be thrown in different scenarios to produce desired outcomes.
These models use a completely different data source than most statistics that measure pitcher ability. There’s no point in mixing SIERA and xFIP together to produce a new statistic because they use very similar information. Mixing pitch quality grades with existing ERA estimators may provide a projection that is better than either statistic could produce independently because the grades provide new information.
How is Pitch Quality Measured?
This section is a detailed overview of how the pitcher grades are produced — feel free to skip ahead if you don’t want to see how the sausage is made.
The core of the pitcher grading model is many smaller sub-models predicting individual event likelihoods. The flowchart below shows how these different sub-models can be joined together to get a full set of predicted outcomes. The reason for using so many sub-models is that by limiting the scope of each one, they can make much better predictions than a more general model that needs to divert its attention between many tasks.
Pitches are categorized into fastballs, breaking balls, and offspeed pitches, each type with its own set of prediction models.
A benefit of splitting the models up in this way is that it makes the grades less of a “black box.” By digging into the predictions, it is possible to see why a player gets good grades; the models may think they will get lots of swings and misses, or generate weak contact and groundballs.
The input variables used by the models are:
- Contextual variables
- Pitcher handedness
- Batter handedness & strike zone height
- Count (balls & strikes)
- Stuff variables
- Velocity
- Spin rate
- Horizontal and vertical movement
- Release point and extension
- Spin efficiency (estimated) and spin/movement axis deviation
- Difference in velocity and movement to the pitcher’s primary fastball
- Location variables
- Pitch height and horizontal position at the plate
The models are produced using XGBoost, a machine learning technique that builds a collection of decision trees. This isn’t the place to go into a fully detailed overview of exactly how this works and specific parameter choices. Overfitting can be an issue with this method, but it may be avoided with careful training processes. These models should be robust enough to not produce bizarre outcomes like the probabilities seen on last season’s Apple TV+ broadcasts.
A Walkthrough of One Pitch
To understand how these predictions work to produce pitch values, here’s an example of a single pitch. On September 24, 2022, Jacob deGrom struck out Sean Murphy with a well-placed 2-2 slider. The models see the pitch location, speed, spin, movement, release point, etc. along with the context (a right-on-right 2-2 breaking ball), and produce the following predictions:
Event | Outcome |
---|---|
xSwing% | 82% |
xWhiff% (assuming a swing) | 46% |
xFoul% (assuming contact) | 44% |
xCalled Strike% (assuming no swing) | 34% |
This gives the following likelihoods for different events:
Event | Outcome |
---|---|
Swinging Strike% | 38% |
Called Strike% | 6% |
Ball% | 12% |
Foul Ball% | 19% |
Ball in Play% | 25% |
For the ball in play outcomes, the model produces the following likelihoods of different types of ball in play:
Batted Ball Type | EV<90mph | 90mph <= EV <= 95mph | 95mph <= EV <= 100mph | 100mph <= EV <= 105mph | 105mph <= EV |
---|---|---|---|---|---|
Groundball | 47% | 6% | 5% | 3% | 2% |
Line Drive | 14% | 3% | 3% | 2% | 1% |
Flyball | 12% | 2% | 1% | <1% | <1% |
This ball is unlikely to be hard hit, with only a 17% probability, and it is most likely to be a groundball if put into play.
Now that we know the range of possible outcomes, we can weigh each outcome by its run value to get an Expected Run Value (xRV) for the pitch. Strikes are better than balls and weak groundballs are better than hard-hit fly balls. xRV is normalized such that the average from any count is always zero. The table below shows how the different outcome probabilities contribute to the overall xRV:
Event | Context-neutral run value of event | Contribution to xRV (run value x probability) |
---|---|---|
Swinging Strike | -0.235 | -0.088 |
Called Strike | -0.235 | -0.015 |
Ball | 0.127 | 0.015 |
Foul Ball | 0 | 0 |
Ball in Play (sum of all different types) | ~0 | ~0 |
Total xRV | -0.088 |
So this pitch was predicted to save the Mets 0.088 runs above the average pitch in that situation. These models are applied to every pitch thrown in major league games.
Grading on the 20-80 Scale
The previous section showed how to get the xRV for a single pitch, but how does that get turned into a statistic to measure pitcher quality? One option is to average all the xRVs and produce a metric on the runs scale. However, this could be tough to interpret, especially since “per pitch” is not a commonly used rate in baseball statistics.
The 20-80 scouting scale is one that many baseball fans are familiar with. If you aren’t, you can read a primer about it here.
Pitcher xRV values are converted to grades on the 20-80 scale. In addition, grades for individual pitch types are shown on this scale too.
The average major league player is a 50 on the scale, with each step of 10 up or down representing one standard deviation in ability. Sixty is above average, 70 is excellent, and 80 represents one of the top players in the majors at that skill. The graph below shows that the distribution of pitcher grades follows a bell curve:
In addition to the pitcher grades, I have also provided PitchingBot ERA. This puts the expected run value onto the ERA scale, which is more familiar to most fans.
Stuff & Location
The pitcher grades discussed so far use models that include both stuff and location features of pitches. I have also produced grades that only use stuff or location variables independently.
The stuff models include everything except where the ball actually goes: pitch velocity, spin, movement, release point, etc. For the location-only models, no stuff variables are included, only the generic pitch type (fastball, breaking, offspeed), the context (balls, strikes, handedness), and the location of the pitch.
For the stuff grades, only models that predict swing events are used (whiffs, foul balls, balls in play). Otherwise, the models would be trying to predict zone rate without any location cues.
Stuff quality is much more stable than location quality within and between seasons, so stuff can be useful for analyzing small samples when other statistics aren’t appropriate. The graph below shows how these grades stabilize more quickly than other pitching statistics, especially the stuff grade. See this article for more details on how the stability of a statistic can be measured.
Example of Usage
To see why these grades may be useful in the analysis of a pitcher, let’s look at Logan Webb. Webb started his major league career by using a four-seam fastball as his primary pitch. In 2021, he switched to using a sinker instead, with much better results. Here are the stuff grades for those two pitches from 2019-22, along with Webb’s PitchingBot ERA.
Season | Four-seamer | Sinker | PitchingBot ERA | ERA | ||
Stuff Grade | Usage | Stuff Grade | Usage | |||
2019 | 41 | 44% | 54 | 13% | 5.28 | 5.22 |
2020 | 38 | 34% | 65 | 15% | 5.14 | 5.47 |
2021 | 42 | 10% | 55 | 38% | 3.69 | 3.03 |
2022 | 34 | 3% | 51 | 33% | 3.28 | 2.9 |
The pitch grading model already knew that Webb’s sinker was much better than his four-seam fastball. When he changed his pitch mix, it produced a marked change in pitch quality and early evidence that he would break out in 2021.
Limitations
There are some caveats that come with the analysis of these grades because the models cannot account for everything that a pitcher does. Notable blind spots include the effects of command, deception, sequencing, and some arsenal effects.
Command is unaccounted for because the model doesn’t know where a pitcher was aiming to throw the ball. Pitchers with good command have improved catcher framing numbers and can exploit hitters’ weaknesses more effectively. Command artist Kyle Hendricks was a notable model over-performer from 2015-2020.
Similarly, some pitchers have deceptive deliveries or effective methods of using different pitches to keep hitters off balance. Ace starters such as Clayton Kershaw, Max Scherzer, and Corbin Burnes have reliably outperformed their expected pitch quality over a large sample.
All ball-in-play predictions are treated independently of spray angle. If a pitcher such as Framber Valdez is able to pitch to the benefit of the defensive positioning behind him, the models do not account for it.
And despite these grades being more stable than regular statistics, they can still be vulnerable to small sample size effects. This should be kept in mind when assessing a pitcher after only a few appearances.
Summary
Statcast data is ushering in a new era of model-based sabermetrics. FanGraphs will now present pitch quality grades that are a stable and results-independent measure of pitcher ability along with its traditional pitch values.
FanGraphs will also have access to all of the predictions underlying these grades, allowing writers to explore new analysis opportunities, which I’m excited to read in the future.
Cameron Grove worked as an independent baseball analyst before joining the Cleveland Guardians front office. His blog can be found here.