Sign In
  • Support FanGraphs
    FanGraphs Membership
    Gift a Membership
    Donate to FanGraphs
    FanGraphs Store
  • Games
    Ottoneu Fantasy Baseball
    Signup, FAQ, Blog Posts
  • Blogs
    Blog Roll

    FanGraphs
    Podcasts: FanGraphs Audio | Effectively Wild

    FanGraphs Prospects
    Podcasts: UMP: The Untitled McDongenhagen Project

    RotoGraphs
    Podcasts: The Sleeper and The Bust | Field of Streams

    The Hardball Times
    Podcasts: THT Audio

    Community Research

    Archived Blogs: NotGraphs | TechGraphs | FanGraphs+
    Archived THT: THT Live | Dispatch | Fantasy | ShysterBall
    Archived Podcasts: Stealing Home | Doing It For Bartolo | OttoGraphs
  • Projections
    2021 Pre-Season Projections
    ZiPS, Steamer, Depth Charts
    ATC, THE BAT, THE BAT X
    2021 600 PA / 200 IP Projections
    Steamer600
    2021 Updated In-Season Projections
    ZiPS (RoS), ZiPS (Update)
    Steamer (RoS), Steamer (Update)
    Depth Charts (RoS)
    THE BAT (RoS), THE BAT X (RoS)
    3 Year Projections
    ZiPS 2021, ZiPS 2022
    DFS Projections
    SaberSim
    Auction Calculator
  • Scores
    Today
    Live Scoreboard, Probable Pitchers
    Live Daily Leaderboards
    Win Probability & Box Scores
    2020, 2019, 2018, 2017...
  • Standings
    2021 Projected Standings
    2021 Playoff Odds, Playoff Odds Graphs
    ZiPS Postseason Game-By-Game Odds
    AL East
    AL Central
    AL West
    NL East
    NL Central
    NL West
  • Leaders
    Major League Leaders
    Batting: 2020, 2019, 2018, 2017, Career
    Pitching: 2020, 2019, 2018, 2017, Career
    Splits Leaderboards
    Season Stat Grid
    60-Game Span Leaderboards (Special)

    KBO Leaders
    Batting, Pitching

    Minor League Leaders
    AAA: International, Pacific Coast, Mexican
    AA: Eastern, Southern, Texas
    A+: California, Carolina, Florida State
    A: Midwest, South Atlantic
    A-: New York-Penn, Northwest
    R: Appalachian, Gulf Coast, Pioneer, Arizona
    R: Dominican
    Legacy Minor League Leaderboards

    WAR Tools
    Combined WAR Leaderboards
    WAR Graphs
    WPA Tools
    WPA Inquirer
    Rookie Leaders
    Batters 2020, Pitchers 2020
    Splits Leaders
    Batters: vs L, vs R, Home, Away
    Pitchers: vs L , vs R, Home, Away
  • Teams
    Team Batting Stats
    2020, 2019, 2018, 2017...
    Team Pitching Stats
    2020, 2019, 2018, 2017...
    Team WAR Totals (RoS)
    AL East
    Blue Jays  |  DC
    Orioles  |  DC
    Rays  |  DC
    Red Sox  |  DC
    Yankees  |  DC
    AL Central
    Cleveland  |  DC
    Royals  |  DC
    Tigers  |  DC
    Twins  |  DC
    White Sox  |  DC
    AL West
    Angels  |  DC
    Astros  |  DC
    Athletics  |  DC
    Mariners  |  DC
    Rangers  |  DC
    NL East
    Braves  |  DC
    Marlins  |  DC
    Mets  |  DC
    Nationals  |  DC
    Phillies  |  DC
    NL Central
    Brewers  |  DC
    Cardinals  |  DC
    Cubs  |  DC
    Pirates  |  DC
    Reds  |  DC
    NL West
    D-backs  |  DC
    Dodgers  |  DC
    Giants  |  DC
    Padres  |  DC
    Rockies  |  DC
    Positional Depth Charts
    Batters: C, 1B, 2B, SS, 3B, LF, CF, RF, DH
    Pitchers: SP, RP
  • RosterResource
    Offseason Tools
    2021 Opening Day Tracker
    2021 Offseason Tracker
    2021 Free Agent Tracker
    2021 Injury Report
    Current Depth Charts
    AL East
    Blue Jays
    Orioles
    Rays
    Red Sox
    Yankees
    AL Central
    Cleveland
    Royals
    Tigers
    Twins
    White Sox
    AL West
    Angels
    Astros
    Athletics
    Mariners
    Rangers
    NL East
    Braves
    Marlins
    Mets
    Nationals
    Phillies
    NL Central
    Brewers
    Cardinals
    Cubs
    Pirates
    Reds
    NL West
    D-backs
    Dodgers
    Giants
    Padres
    Rockies
  • Prospects
    Prospects Home
    THE BOARD!
    THE BOARD: Scouting + Stats!
    How To Use THE BOARD: A Tutorial
    Top Prospects List
    Top Prospects
    20212020
    AL
    BALCHWHOU
    BOSCLELAA
    NYYDETOAK
    TBRKCRSEA
    TORMINTEX
    NL
    ATLCHCARI
    MIACINCOL
    NYMMILLAD
    PHIPITSDP
    WSNSTLSFG
    AL
    BALCHWHOU
    BOSCLELAA
    NYYDETOAK
    TBRKCRSEA
    TORMINTEX
    NL
    ATLCHCARI
    MIACINCOL
    NYMMILLAD
    PHIPITSDP
    WSNSTLSFG

    • 2020 Preseason Top 100

  • Glossary
    Library
    Batting Stats
    wOBA, wRC+, ISO, K% & BB%, more...
    Pitching Stats
    FIP, xFIP, BABIP, K/9 & BB/9, more...
    Defensive Stats
    UZR Primer, DRS, FSR, TZ & TZL, more...
    More
    WAR, UBR Primer, WPA, LI, Clutch
    Guts!
    Seasonal Constants
    Park Factors
    Park Factors by Handedness
  • Sign In
Help Support FanGraphs


Become a Member No Thanks
Already a member? Log In
  • Intro
  • Features
  • Offense
    • Complete List (Offense)
    • OBP
    • OPS and OPS+
    • wOBA
    • wRC and wRC+
    • wRAA
    • Off
    • BsR
    • UBR
    • wSB
    • wGDP
    • BABIP
    • ISO
    • HR/FB
    • Spd
    • Pull%/Cent%/Oppo%
    • Soft%/Med%/Hard%
    • GB%, LD%, FB%
    • K% and BB%
    • Plate Discipline (O-Swing%, Z-Swing%, etc.)
    • Pitch Type Linear Weights
    • Pace
  • Defense
    • Overview
    • Def
    • UZR
    • DRS
    • Defensive Runs Saved – 2020 Update
    • Inside Edge Fielding
    • Catcher Defense
    • FSR
    • RZR
    • TZ / TZL
  • Pitching
    • Complete List (Pitching)
    • ERA
    • WHIP
    • FIP
    • xFIP
    • SIERA
    • Strikeout and Walk Rates
    • Pull%/Cent%/Oppo%
    • Soft%/Med%/Hard%
    • GB%, LD%, FB%
    • BABIP
    • HR/FB
    • LOB%
    • Pitch Type Linear Weights
    • SD / MD
    • ERA- / FIP- / xFIP-
    • Plate Discipline (O-Swing%, Z-Swing%, etc.)
    • Pace
    • PITCHF/x
      • What is PITCHF/x?
      • Pitch Type Abbreviations & Classifications
      • Heat Maps
      • Common Mistakes
      • PITCHf/x Resources
  • WE/RE/LI
    • RE24
    • Win Expectancy
    • WPA
    • LI
    • WPA/LI
    • Clutch
  • Principles
    • DIPS
    • Regression toward the Mean
    • Replacement Level
    • Sample Size
    • Splits
    • Projection Systems
    • Linear Weights
    • Counting vs. Rate Statistics
    • Park Factors
    • Park Factors – 5 Year Regressed
    • Positional Adjustment
    • Aging Curve
    • League Equivalencies
    • Pythagorean Win-Loss
    • Luck
  • WAR
    • What is WAR?
    • WAR for Position Players
    • WAR for Pitchers
    • FDP
    • fWAR, rWAR, and WARP
    • WAR Misconceptions
  • Business

Sample Size

by Steve Slowinski
February 18, 2010

A baseball season is the amalgamation of a lot of little events. Each pitch fits into a plate appearance which fits into an inning which fits into a game which fits into a series which fits into a season. That’s a lot of little data points flowing into an overall end result. We care a lot about which players will have good seasons and careers. It matters to us that we can distinguish between good players and bad players, but doing so requires that we understand which chunks of data are meaningful and which aren’t.

Enter sample size. You’ve heard this phrase plenty over the  years when talking about baseball statistics and it’s usually a conversation ender rather than a conversation starter. Someone cites a stat and then another person says it doesn’t matter because the sample size is too small. What does that mean and how should we properly think about sample size in baseball?

If you’re just looking for the numbers, skip ahead by clicking here .

Overview:

Each little moment in baseball is essentially random. Not random in the sense that all outcomes are equally likely and subject completely to chance, but random in the sense that the most likely outcome doesn’t happen every time. If the best hitter in baseball faced the worst pitcher 100 times, he would very likely strike out a couple of times and hit into a double play or two. He wouldn’t always hit a home run even if it was Coors Field and the pitcher was throwing meatballs. Think about the home run derby. MLB players can’t simply hit home runs on demand even when the pitcher is trying to help.

When dealing with pitches flying 90+ miles per hour and split second movements, a whole bunch of randomness gets thrown into the pot. This means that any one plate appearance might have a funky result, meaning that you need to see lots of events to get a clear picture of what is going on. You know this. One time Don Kelly took Yu Darvish deep.

So of course, we know that a single plate appearance isn’t a convincing amount of data. Even the least sabermetrically-minded person agrees with that concept. That single plate appearance is a valid data point, but it’s not enough information to inform your opinion very fully. Instead, you need more and more data points until you have enough for them to “stabilize.” Remember that word because the way we’re going to come back to it in a very specific way in a moment.

Essentially, we want to make sure we have enough observations that the random noise gets cancelled out. Don Kelly hit a home run against Yu Darvish one time, but how many Kelly versus Davrish at bats do we need before we can accurately access their abilities? It’s more than one for sure, but the actual number you need depends on the skill you’re trying to analyze. For example, strikeout rate starts to communicate useful information in fewer than 100 PA while BABIP for a pitcher can take three years. The difference is the nature of the skill and the number of factors that influence the outcome of the play. With respect to strikeout rate, we’re only talking about the batter and pitcher’s ability to make or allow contact (or let strikes go by). When you’re talking about BABIP, you’re adding in quality of contact, direction, weather, defensive ability, luck, etc. That means there’s more room for noise and things with less noise in the actual data generating process stabilize more quickly. There are also diminishing returns. Having 20 PA is better than five PA, but having 520 PA is only a little better than having 505 PA.

Stabilization? Reliability?

So let’s go back to this idea of stabilization. This is the word you hear a lot in conversation about baseball statistics. Conceptually, it’s an ironclad idea. You want to know how many data points it takes for the current information to provide an accurate assessment of the player in question. However, there is no one point at which something stabilizes. Things become stable over time, at a given speed. So after five PA, you know more about a hitter’s walk rate than after one PA, but you don’t know as much as you do after 150 PA or 600 PA. A statistic doesn’t stabilize, it becomes more stable.

The word “stabilize” got into the baseball lexicon after some work by Russell Carleton (aka Pizza Cutter), who looked to see how many PA you need for a given statistic to reach the point where the correlation between that sample and another sample of the same size is 0.7 (i.e. R^2 of .49). That’s the colloquial definition of stabilize and despite Carleton’s constant warnings, most of us picked up the word “stabilize” and ran with it even if it’s not the most useful term. He has done updated work here and here, and has some thoughts about how we talk about stabilization and reliability here.

Regardless, the key is that 100 PA is better than 50 PA no matter the statistic, but smaller samples are more useful for some statistics than others. It’s always better to have more data, but the rate at which the data becomes useful varies based on the statistic. A good way to think about this is by visualizing a curve.

This is from work by Sean Dolinar and Jonah Pemstein which takes a methodology similar to Carleton’s but doesn’t focus on the .49 R^2 (technically Cronbach’s Alpha for the technically inclined). Instead, they plot the reliability measure for each 10 PA increment to better show the nature of stabilization. As you can see, K% crosses the .49 threshold much more quickly that the other statistics, which is consistent with our common understanding, but this also shows that the first 200 PA are much more important than the second 200 PA for understanding K%, something you wouldn’t necessarily see in the Carleton studies. Here is a link to their update work which allows you to visualize many different statistics using this method. The same tool is available below.

Practical Use:

For practical purposes, you really want to know the difference between a sample that’s meaningful and one that isn’t. There isn’t a point at which it becomes useful data all of a sudden, but there are quantities that are clearly one or the other. For example, 400 PA is enough to tell you a lot about strikeout rate and 30 PA is not enough to tell you much about HR rate. The real trick is how you update your beliefs between those two extremes.

Every April, at least one previously bad hitter has an awesome month. They have a .380 wOBA over three weeks and lots of people rush to suggest they are a breakout candidate who did something during the offseason to improve. It’s important to note that this may be true or it may not be true. All we know is that they hit .380 wOBA over three weeks, let’s call it 85 PA.

Are those 85 PA enough to lead us to totally change our opinion about this bad hitter to the point where we now think they are fundamentally different in the box? Using our sample size rules of thumb, the answer is no. A bad hitter can easily have a .380 wOBA over 85 PA without actually being a different hitter, just due to random chance. A couple lucky bounces and a well timed cluster of hits and his numbers look great even if he’s no different than he was before.

Those 85 PA give you some idea that he might be improving, but they are not sufficient to change your mind completely. A true .380 wOBA hitter should hit .380 over more stretches than a .310 wOBA hitter, but a .310 wOBA hitter can hit .380 over a month no problem.

The goal of these “stabilization” or “reliability” numbers is to prevent you from reacting to data that is still highly susceptible to random chance. Good hitters can have bad results in small samples even when their process is fine. The larger the sample gets, the more we can discount this random noise component and zero in on factors that are within the player’s control. That’s the point of sample size; preventing us from ascribing too much meaning to small chunks of data.


Rules of Thumb:

Here is a tool that allows you to explore those reliability graphs mentioned earlier. If it’s not loading on this page for you, here is a link you can use to find it.

If you’re only looking for Carleton’s .49 cut points, they are listed below.

“Stabilization” Points for Offense Statistics:

  • 60 PA: Strikeout rate
  • 120 PA: Walk rate
  • 240 PA: HBP rate
  • 290 PA: Single rate
  • 1610 PA: XBH rate
  • 170 PA: HR rate
  • 910 AB: AVG
  • 460 PA: OBP
  • 320 AB: SLG
  • 160 AB: ISO
  • 80 BIP: GB rate
  • 80 BIP: FB rate
  • 600 BIP: LD rate
  • 50 FBs: HR per FB
  • 820 BIP: BABIP

“Stabilization” Points for Pitching Statistics:

  • 70 BF: Strikeout rate
  • 170 BF: Walk rate
  • 640 BF: HBP rate
  • 670 BF: Single rate
  • 1450 BF: XBH rate
  • 1320 BF: HR rate
  • 630 BF: AVG
  • 540 BF: OBP
  • 550 AB: SLG
  • 630 AB: ISO
  • 70 BIP: GB rate
  • 70 BIP: FB rate
  • 650 BIP: LD rate
  • 400 FB: HR per FB
  • 2000 BIP: BABIP

 

Links to Further Reading:

525,600 Minutes: How Do You Measure a Player in a Year? – Statistically Speaking / Pizza Cutter

On the Reliability of Pitching Stats – Statistically Speaking / Pizza Cutter

When Samples Become Reliable – FanGraphs

The Beginner’s Guide to Sample Size – FanGraphs

A New Way to Look at Sample Size – FanGraphs

A Long-Needed Update on Reliability – FanGraphs

Should I Worry About My Favorite Pitcher – Baseball Propsectus

It’s A Small Sample Size After All – Baseball Prospectus

Reliably Stable (You Keep Using That Word) – Baseball Propsectus

-Neil Weinberg, Updated August 2017





What is PITCHF/x?
 
Pitch Type Linear Weights

Steve is the editor-in-chief of DRaysBay and the keeper of the FanGraphs Library. You can follow him on Twitter at @steveslow.

5
Leave a Reply

Please Login to comment
3 Comment threads
2 Thread replies
0 Followers
 
Most reacted comment
Hottest comment thread
4 Comment authors
JakeJamesSteve Slowinskitheperfectgame Recent comment authors
newest oldest most voted
theperfectgame
Member
theperfectgame
You can flag a comment by clicking its flag icon. Website admin will know that you reported it. Admins may or may not choose to remove the comment or block the author. And please don't worry, your report will be anonymous.

Steve,
If GB rate and GB/FB are reliable at 200 PA, doesn’t that imply that FB rate is also reliable at 200 PA? I only ask because you have it listed at 250 PA. Sorry if this was nitpicking. I think the whole library is fantastic. Thanks!

Vote Up0Vote Down 
10 years ago
theperfectgame
Member
theperfectgame
You can flag a comment by clicking its flag icon. Website admin will know that you reported it. Admins may or may not choose to remove the comment or block the author. And please don't worry, your report will be anonymous.

In the same vein, isn’t BB/PA implicitly reliable at 500 BF? Thanks again.

Vote Up0Vote Down 
10 years ago
Steve Slowinski
Author
Steve Slowinski
You can flag a comment by clicking its flag icon. Website admin will know that you reported it. Admins may or may not choose to remove the comment or block the author. And please don't worry, your report will be anonymous.

Good point….I’m just quoting word for word from the research that was done. I can double-check all that, but otherwise I’d go with what is listed. Sometimes individual stats can be reliable in one context, but once you start mixing them together, the results aren’t always the same.

Also, geez, the formatting on this page did not transfer well. Time to fix that.

Vote Up0Vote Down 
10 years ago
James
Guest
James
You can flag a comment by clicking its flag icon. Website admin will know that you reported it. Admins may or may not choose to remove the comment or block the author. And please don't worry, your report will be anonymous.

This came up in the original thread, here is Pizza Cutter’s answer:

“Those numbers are per PA, not per ball in play. So, for one player who always puts the ball in play LD + GB + FB may account for 95% of his plate appearances. For another guy who strikes out and walks a lot (we’ll call him “Adam Dunn” just to give him a name), LD + GB + FB might only cover 70% of his PA’s.”

Vote Up0Vote Down 
9 years ago
Jake
Guest
Jake
You can flag a comment by clicking its flag icon. Website admin will know that you reported it. Admins may or may not choose to remove the comment or block the author. And please don't worry, your report will be anonymous.

I’m confused about split-half methodology and how the results can be interpreted. When you say that measure X stabilizes at 200 PAs, for example, how is that reflected in your methodology? If I understand correctly, you took 400 PAs and compared the 200 odd PAs to the 200 even PAs and looked for correlation. But that doesn’t tell you that an individuals’ first 200 PAs correlate to his next 200 PAs, yet it’s being advertised as such. Instead, it means that it takes 200 PAs consisting of every other PA over 400 PAs for measure X to stabilize. That’s not… Read more »

Vote Up0Vote Down 
8 years ago
You are going to send email to

Move Comment

Updated: Wednesday, March 3, 2021 3:22 AM ETUpdated: 3/3/2021 3:22 AM ET
Player Linker - Contact Us - Advertise - Terms of Service - Privacy Policy
sis_logo
All major league baseball data including pitch type, velocity, batted ball location, and play-by-play data provided by Sports Info Solutions.
mlb logo
Major League and Minor League Baseball data provided by Major League Baseball.
Mitchel Lichtman
All UZR (ultimate zone rating) calculations are provided courtesy of Mitchel Lichtman.
TangoTiger.com
All Win Expectancy, Leverage Index, Run Expectancy, and Fans Scouting Report data licenced from TangoTiger.com
Retrosheet.org
Play-by-play data prior to 2002 was obtained free of charge from and is copyrighted by Retrosheet.