MLB Starting Rotations: Using Data to Define an Ace (and a 2 and a 3 and a 4…)

Since I really want to use the blogosphere to solve as many of baseball’s infinite puzzles that I possibly can (within the constraints of life), it probably seems like I’m not being very ambitious with this post – at least if you’re judging by the title. I get it…there’s even a definition of “Ace” provided by Major League Baseball at MLB.com. That’s about as official as it gets, so consider this a closed case, right? Well, you can probably assume from the inclusion of hundreds of words below this paragraph that my answer is no. There’s really not much existing literature that delineates the parameters of an “ace” (or any other spot in the rotation) in an objective, data-driven manner. The great Jeff Sullivan was on the cusp with this Fangraphs post, but ultimately conducted an opinion poll in which readers were asked if they considered the top SP in each team’s rotation an “ace”. Both Jeff’s methodology and his conclusions underscore additional benefits of establishing objective, context-neutral parameters:

(NOTE: This isn’t a criticism of Jeff Sullivan or his post…he’s probably my favorite baseball writer by a wide margin, and his objectives with said post were not the same as my objectives with this post):

  1. Posted prior to the 2016 season, the content is contemporaneously relevant, and 71% of respondents considered Sonny Gray an ace. With statistically-rigid definitions of what an ace is, we could compare Sonny Gray’s performance at that point instead of laughing at the mere thought of being asked “Is Sonny Gray an ace?”. At this juncture I’d imagine Gray’s perception is that of a fringe starter who fills in when someone goes down. But is that what he really is? I don’t know, we haven’t established what makes a fringe starter either. With context-neutral definitions of each rotation spot, we can eliminate the contemporaneous relevance and easily make comparisons across seasons or even eras.
  2. Jeff concluded there were about 20 starting pitchers in Major League Baseball that most people would agree were aces, which makes us 10 shy of what we’d expect given the MLB definition of “ace” (the top starting pitcher on a team). While small year-to-year variances are to be expected, we should consistently find about 30 pitchers to fall within the parameters of acehood. So really, Jeff’s poll found there was a perception that 20 aces were active at the time – I contend that there were actually around 30, and roughly a third of them weren’t all that obvious. We want to eliminate the perception aspect with definitive criteria that undeniably establishes acehood.
  3. It turns out that the perception of an ace wasn’t completely performance-based (shocker!): pitchers from more talented rotations were penalized for being teammates with other good starting pitchers. Stephen Strasburg outperformed many of the pitchers who scored higher than him, yet only 57% of respondents considered him an ace – largely due to being in the same rotation as Max Scherzer (and probably injuries). While some may consider it fundamentally incorrect to label multiple pitchers from the same rotation “aces”, it’s going to be harder to convince me that a league-average pitcher who leads a rotation where he’s followed by 4 below-average teammates is more worthy of the ace label. Objectively speaking, an ace is unconditionally an ace based on performance (not on that of his teammates). The ace parameters will rid us of the perception penalty incurred by aces who are teammates with aces, and likewise the perception benefit bestowed on non-aces who overshadow their relatively inferior rotation mates.

Before we go any further, I want to make it clear that I’m writing under the assumption that an “ace” and a “#1” are synonymous. On a recent episode of Effectively Wild, Ben, Jeff, and Meg Rowley all bantered about how we define an ace, and even briefly attempted to distinguish the differences between an ace and a #1; not that they’re mutually exclusive, but it sounded more like the beginning of an LSAT logic game where ‘all aces are #1s, but not all #1s are aces…’ from what I gathered. I don’t want to strictly adhere to the MLB.com definition, but for the sake of this post, we’re going to at least continue under the assumption that aces and #1s meet the same defining criteria as each other.

Perhaps counterintuitively, the task of defining each role within a rotation is even more important given the lightening workloads of starting pitchers, and, inversely, the increasing workloads of relievers. The paradigm shifts with caution, and no team is should have a perennial Cy Young candidate throw anything less than the greatest quantity of innings he can possibly throw without sacrificing performance or health.

With the advent of the Opener, what truly constitutes a “Starting Pitcher” is becoming increasingly vague. It wouldn’t be much of a surprise to see some of the more traditional roles played by back-of-the-rotation starting pitchers to completely disappear in the pretty near future. But it should be a little more than obvious that this evolutionary process isn’t necessary for all SPs, right? Perhaps the most likely progression begins with the teams under tighter budget constraints, relatively deeper relief corps than starting corps, and the ones just a little more forward thinking. We saw the Rays unveil the strategy out of necessity, soon followed by the injury-stricken Athletics. But what was spawned initially out of necessity for the early adopters should presumably expand to teams doing it out of practicality.

But in the wake of all this, one puzzle we’re left to figure out revolves around the pitchers to cut from their traditional role – who should be sacrificed to this developing experiment?

I’m not going to try and answer that in THIS post, because we need to solve another puzzle as a prerequisite – the definition of each spot in the rotation. On one hand, it couldn’t be simpler; each spot is based on the order of talent within a given pool of starting pitchers, beginning with the most talented at the top. On the other hand, it’s a complex and generally subjective matter, albeit unnecessarily; a lot of credible baseball people might require seemingly arbitrary attributes, like a minimum fastball velocity for an ace, or more strikeouts than innings for anyone in the one or two spot. I’m not saying these ideas are necessarily incorrect either, but my goal is to wash away the ambiguity. Defining the performance expectations of each spot in the rotation can be done objectively by analyzing some key metrics and keeping the parameters simple.

First we’ll define the parameters. We know MLB’s definition of an “ace” is the best starting pitcher on a given team. We also concede that not every team has an ace because talent isn’t equally distributed. So how we divide the pitcher roles will be across teams rather than within them; this means “aces” will be the top 30 starting pitchers in MLB, not the single best starting pitcher from each of the 30 teams (which is how we’d determine acehood using MLB’s definition).

As easy as it is to envision the stereotypical grumpy baseball traditionalist reciting how only a few pitchers handled the majority of innings decades ago, 5-man rotations outnumbered all other combinations for the first time in 1926 (believe it or not, the 6 man rotation was actually more common than the 3-man rotation at that point). So we can call a rotation a pool of five starting pitchers without much controversy. However, given how improbable it is to expect the same 5 pitchers to make all their scheduled starts in a given year, every team generally has a 6th pitcher who can start (either in theory or an actual place on the 25-man roster) whenever someone from the top 5 can’t. As a role every team has been forced to utilize, and the means by which many SPs crack their first rotation, the 6th spot is by no means trivial. So, while we’ll call a rotation a set of 5 SPs, we’re also saying they’re the top 5 from a pool of 6 pitchers. This establishes 6 tiers that, under optimal conditions, would be represented by sextiles (that’s what you call 6 equally-sized groups) of talent, where the first sextile holds the top 16.7% of talent, which descends with each tier.

Unfortunately, since true talent can’t really be quantified, we’ll have to proxy talent with performance metrics. Here I’m going to use ERA-, FIP-, and xFIP-. This lets us compare the metrics equally across different seasons, leagues, and parks, creating a context-neutral benchmark for comparison. I assume anyone who finds themselves on this blog is familiar with these three metrics and why they’re more useful than their slightly-more-traditional-non-minus counterparts. But if not, I highly recommend checking out their entries in the Fangraphs Glossary (you’ll learn a ton in like 5 minutes).

(NOTE: If you REALLY don’t feel like leaving the page, the key here is the number 100; 100 is average. An ERA-/FIP-/xFIP- under 100 is better than average, and anything above 100 is worse than average, with the absolute difference representing the percent better or worse than average. For example, a FIP- of 75 is 25% better (less) than league average: 75 – 100 = -25%. For normalized stats that end in “-“, any measure below 100 is good, while the opposite holds true for normalized metrics ending with a “+”, such as wRC+.)

Instead of using these metrics individually for our approximation of talent, I’m going to use the average. ERA often comes under fire because it’s a relatively poor predictor of future performance due to the amount of luck associated with its inputs – which is well warranted given that both FIP, xFIP, and even K%-BB% actually predict future ERA better than past ERA. But I’m including ERA here because I don’t see any reason to omit past success as a component that defines an ace, or any other tier of a rotation, lucky or unlucky. However, since we’re attempting to approximate talent to define each tier, it’s important we limit the magnitude of ERA since much of its variance is fielding-dependent. We do this by including the other two metrics, FIP(-) and xFIP(-), both which are obviously fielding-independent, and rely exclusively on the pitcher. Furthermore, while each metric is results-based, the most forward-looking of them is xFIP, which is a better predictor than both FIP and ERA are of their future selves. So while xFIP might be the worst descriptor of what actually happened, it’s easily the best indicator of what will eventually happen. This is important is because it makes future expectations a part of the equation.

Additionally, while it won’t be perfect given the incomparable year-to-year variance of each respective metric, the average also gives us an idea of the rough cutoff for each metric individually. So once we establish our cutoffs, we could say, “player X had an ERA- of 99 but an xFIP of -75. So he pitched like a #3 starter, but I expect him to pitch like an ace moving forward”.

So our talent proxy is simply the average of ERA-, FIP-, and xFIP-, which I’ll call MEAN-. Once we establish the cutoff for each sextile, our tiers will be defined. Using data from 2002 through 2018, I looked at every pitcher who threw at least 100 IP as a starter, calculated both their MEAN- and their respective MEAN- percentile rank, and here’s what we have:

While splitting our data into sextiles gives us the mathematical explanation as to why this happened, at first glance it might seem odd to see Tier 4 begin with the league average MEAN-…because league average should be a #3, shouldn’t it? Actually it shouldn’t. There’s a reason top pitching prospects are often given labels that imply something as seemingly underwhelming as a “3rd starter” – it’s because 3rd starters are (barely) above average pitchers. Sure, they’re seen as the midpoint in the rotation, but they’re only the midpoint when the best 5 options make all their scheduled starts, themselves included. At some point, every team utilizes their 6th option, with few exceptions. In 2018, the Indians and Rockies used the fewest starting pitchers with 7, while the average big league team utilized 12. Starting pitchers whose innings total ranked 6th or lower on their respective teams accounted for 18.8% of starting pitcher innings – only the top ranked starting pitcher (and presumable ace) accounted for more with 21%. This helps explain why the 4th Tier is where league average goes, and not the 3rd Tier.

The table above shows some average performance metrics of the starting pitchers within each tier dating back to 2002. Everything descends or ascends in the order you’d expect it to, but one interesting thing about the table is the WAR column. Tiers 2 through 6 are separated pretty evenly, ranging anywhere between a 0.6 and 0.8 WAR differential with the adjacent tier. The exception is Tier 1 (our Ace Tier), which is a full 1.5 WAR ahead of Tier 2. We can see this more clearly in the table of average WAR by tier; the linearity holds steady for the most part in tiers 2 through 6, only to slope sharper from 1 to 2. So even while we’ll find roughly the same number of pitchers within each tier on an annual basis, upgrading from a Tier 3 pitcher to a Tier 2 pitcher won’t yield the same improvement you’d see from upgrading a Tier 2 to a Tier 1. The roughly equal tier-by-tier difference in WAR from the bottom 5 tiers suggests we get essentially flat marginal returns from any single-tier upgrade unless we’re adding a Tier 1 guy (an ace!).

    

That may have been tough to follow, but let me put it another way. Let’s say you’re a GM headed into the offseason with the goal of upgrading your rotation via trade. For the sake of this hypothetical, you’re only able to offer one trade package comprised of a starting pitcher from your current rotation, a prospect, and cash. In return, you’ll receive a starting pitcher that’s 1 tier better than the SP you’re trading away (the prospect and cash are irrelevant other than making the tier downgrade worthwhile for your trade partner). We’ll hold the prospect and cash fixed, so the only part of the offer you can change is the tier of the pitcher you give up, and therefore, the tier of the pitcher you receive. So here’s what you’re looking at in the trade for a new SP:

  • Assume your 5-man rotation is comprised of a starting pitcher from each of the top 5 tiers
  • You also have a Tier 6 pitcher you use as a spot starter
  • Your ace is the only pitcher you’re unable to trade
  • If you give up a Tier 6, you’ll receive a Tier 5    (~0.8 net WAR)
  • If you give up a Tier 5, you’ll receive a Tier 4    (~0.7 net WAR)
  • If you give up a Tier 4, you’ll receive a Tier 3    (~0.6 net WAR)
  • If you give up a Tier 3, you’ll receive a Tier 2    (~0.8 net WAR)
  • If you give up a Tier 2, you’ll receive a Tier 1    (~1.5 net WAR)

The right thing to do here is to give up your Tier 2 pitcher, so you end up getting a Tier 1 SP. Sure, you get two aces in the rotation now, but the reason for giving up your #2 isn’t as simple as ‘adding an ace’. The reason you gave up your Tier 2 for a Tier 1 is because it represented the only offer with a marginal upgrade compared to what was on the table. In other words, the added benefit from swapping a Tier 6 with a Tier 5 is roughly the same as the added benefit from swapping a Tier 5 for a Tier 4, a Tier 4 for a Tier 3, and a Tier 3 for a Tier 2.

Since I have a habit of overexplaining things, I’ll end with some examples of each tier using numbers from the 2018 season. For the table of 2018 Tier Examples, 5 randomly selected pitchers within each tier were chosen just so readers get a better idea of who falls in line with a given tier.

#pitching

How to Identify Bounceback Candidates (Pitcher Edition)

Okay, a lot of people think ERA sucks. Sure, I don’t really disagree in the sense that it’s luck-laden and a poor predictor of future performance. It’s a shallow measure, but it still seems to get the best of those even at the highest levels; Jon Gray was left off the Rockies’ playoff roster after posting a 5.12 ERA that wasn’t really compatible with his 9.6 K/9 and 2.72 BB/9. Domingo German couldn’t stay in the Majors with his 5.57 ERA in spite of striking out nearly 11 per 9 and walking 3.5/9.

This isn’t a defense of ERA by any means – its not. This is a guide to find out who’s 2019 ERA is (probably) going to be better than their 2018 ERA, and it’s pretty simple. Fangraphs features a metric called “E-F”, which is simply a pitcher’s ERA minus FIP. This can give us some idea of how representative the pitcher’s ERA actually is – grossly oversimplified, it gives us a measure of luck. The following facts have been fairly well-documented, but just for a refresh, I want to reiterate the following:

  • ERA is a relatively poor predictor of future ERA
  • FIP is a better predictor of future ERA but still not great
  • xFIP is a better predictor of future ERA and future FIP than both ERA and FIP

Results-based analysis is tricky business, but not totally unreliable when done correctly. ERA is far from the ideal indicator of a pitcher’s ability, which has been addressed through FIP, which also includes a lot of noise that’s washed away in xFIP. Things that show little or no year-to-year correlation, such as HR/FB% or BABIP, are controlled for by applying constants in the calculation of xFIP, which is why it’s probably the best metric we use to evaluate how good a pitcher’s been, at least in the same context of ERA. Unfortunately, fans, fantasy leagues, and the general consumption of baseball continue to emphasize ERA in spite of it’s obvious shortcomings, probably due to a fear of adaptation. So even though it would be more practical and easier to predict future xFIP, we’re going to predict future ERA with xFIP, since it’s still the best we’ve got.

Let’s check out the correlation matrix of ERA predictors I put together. This uses all big-league pitchers from 2010-2017 with at least 30 IP in a given half-season who also threw at least 30 IP in the subsequent half-season. I did notice that the within-period correlations aren’t identical in both time periods (ERA’s respective correlation to FIP and xFIP is .67 and .49 in t=0, but .70 and .55 in t+1…this still occurs even when ERA-/FIP-/xFIP- are used instead, so I’m theorizing that it’s just a matter of a pitcher gaining consistency with an additional year of experience, but that’s another post for another day.) We can see that each of the bullet points above are reflected in the matrix, and that xFIP does a much better job of predicting the future than any other metric. So what am I trying to prove here? That xFIP is a super useful metric that isn’t used enough for predictive analysis! And unlike ERA, xFIP is a superb predictor of itself, which is why I highlighted that particular part of the matrix, and added the chart on xFIP predictability. Worth noting is that the full-season correlation between ERA and xFIP is a much better-looking 0.64, compared to the half-season correlations shown in the matrix, so being able to predict xFIP from one period to the next is pretty valuable.

        

So now that I’ve emphasized the value of xFIP versus the other metrics as predictors with some visual overkill, I’m going to rework the Fangraphs’ metric I mentioned earlier: instead of E-F (ERA-FIP), we’ll be using E-X (ERA-xFIP).

Let’s set up some definitions that will apply to the remainder of this post:

  1. Overachiever – A pitcher who’s xFIP exceeds his ERA. In this case the E-X is negative.

    2018 Example: Wade Miley; 2.57 ERA/ 4.3 xFIP/ -1.73 E-X with MIL

  2. Underachiever – A pitcher who’s xFIP is less than his ERA. In this case the E-X is positive.

    2018 Example: Marcus Stroman; 5.54 ERA/ 3.84 xFIP/ 1.7 E-X with TOR

The intuition here is simple enough – overachievers are due for positive regression (remember that “positive” is bad when it comes to ERA/FIP/xFIP) and underachievers are due for negative regression. In other words, pitchers with a negative E-X should see their ERAs increase, while pitchers with a positive E-X should see their ERAs decrease. I said “should”, but I really mean “do”, because the effect is quite robust when we use aggregated data. The first chart looking at ERA changes from 2017 to 2018 suggests that, while E-X is a good indicator of the direction a pitcher’s ERA is headed, underachievers appear to be more predictable than overachievers – at least using non-normalized metrics.

STANDARD
ERA & xFIP

Now since ERA is known to fluctuate over time and we need normalized metrics to compare across eras, I wanted to see how predictability changes (if it does at all) when we use ERA- and xFIP- instead of standard ERA and xFIP. Here, the effect is consistent across both groups (both overachievers and underachievers). Take a look at the chart below:

NORMALIZED
ERA & xFIP (ERA- & xFIP-)

This tells us that roughly 73% of overachieving pitchers in 2017 saw a rise in their 2018 ERA, while almost an identical portion of 2017 underachievers (72%) saw a decline in their 2018 ERA. That means, with respect to this sample, nearly three-quarters of the time we accurately predicted the direction of future ERA by subtracting xFIP- from ERA-. This is pretty powerful, but it’s limited in the sense that we’re looking at a binary prediction – its yes or no; while we can reasonably expect the ERA to increase or decrease, we don’t know by how much. And we all know to be skeptical when sample sizes are small; just 169 pitchers threw at least 40 IP in both 2018 and 2017, so let’s see what happens when we have a sample 8.5 larger than what’s reflected in the 2017/2018 chart…

NORMALIZED
ERA & xFIP (ERA- & xFIP-)

And there you go; 71% of overachievers saw their ERA go up in the subsequent half-season, and 72% of underachievers saw their ERA go down – basically unchanged from the previous chart. Here, time is grouped into half seasons rather than full seasons, which gives us an even greater sample to look at. So E-X is legit when it comes to predicting improvement or decline, but why not build on that if we can? If we’re trying to identify bounceback candidates, wouldn’t it be nice if we could know exactly how likely it is that a pitcher’s ERA will be lower next season (or next half-season) than it was in the most recent one?

Obviously the answer is ‘yes’, so I modeled the probability of ERA improvement using E-X as the singular dependent variable and ran a logistic regression on the binary outcome of whether or not ERA improved in the in half-season t+1. The summary statistics are shown below, as well as how to calculate the probability.

N=1454

Calculating the probability estimate of this model isn’t like a typical linear regression, so if you wanted to apply it to a particular pitcher on your own, here’s how it works:

So rather than going through too much more math, lets move on to what the model tells us by using the probability of ERA Improvement chart:

This shows us the estimated probability of a given pitcher improving his ERA in the next time period (in this case, half of a season), based on the E-X in the most recent period. While the model is built off half-season samples, we can reasonably apply it different time groups that occur consecutively, like a full season (we don’t want to stray too far from the half-season though, because we’d fail to account for a lot player-specific changes that might occur in the two time periods. For example, we wouldn’t want t=0 to be the last 5 years, where we’re trying to predict improvement in the next 5 years, because a lot of changes could occur with the pitcher we’re looking at; his mix might change, his velocity almost certainly will, perhaps Tommy John surgery, etc.) So, at an E-X of 0, we see the probability of improving ERA is 50%, which is right where we’d expect it to be (actually it’s 49.8% if we take it out to the thousandths place…the absolute probability difference in an E-X of 0 and -10 is actually almost the same as the difference between 0 and +10, but I kept the probability estimates to two decimal places for the sake of simplicity). The greater the E-X in the most recent (half) season, the more likely it is the pitcher’s ERA will drop in the next (half) season; even though only 18% of pitchers post E-Xs of at least 20, it’s certainly worth noting their probability of improvement is better than three-quarters. Even more rare is an E-X of 40 or greater, which occurs just 4% of the time, but is practically a guarantee of improvement at 91%.

So just for fun, let’s apply the model to a pitcher using his 2018 E-X, and determine the probability that his ERA will improve. One guy a lot of people might be curious about is Sonny Gray; are greener pastures ahead for Sonny in 2019? Or was all that chaos in New York City the catalyst to an irreversible downward trend? Well…let’s find out!

2018 Sonny Gray – NYY

ERA: 4.90    xFIP: 4.10

ERA-: 113    xFIP-: 97

E-X = 113-97 = 16    Now we’ll apply the model…

1/(1+e^-[-0.06+{0.059*16}]) = 0.718

Estimated probability of improvement is 71.8%! So Sonny Gray’s got a pretty good shot at being a better pitcher in 2019 than he was in 2018.

Let’s do another…how about NL Cy Young Award winner Jacob DeGrom? DeGrom had an absolutely insane year that a bunch of morons tried discrediting at various stages, but most of the people reading this are probably aware of how special it actually was. So how likely is it that DeGrom could be even better next year?

2018 Jacob DeGrom – NYM

ERA: 1.70    xFIP: 2.60

ERA-: 45    xFIP-: 64

    E-X = 45-64 = -19

1/(1+e^-[-0.06+{0.059*-19}]) = 0.245

So the model gives DeGrom a 24.5% shot at improving his ERA in 2019, which isn’t that bad considering there’s not much room for improvement when your ERA is 1.7…the closer you get to 0, the more improbable improvement becomes!

Instead of continuing with random case-by-case examples, I added a few names to the probability chart to go along with Sonny Gray and Jacob DeGrom. I also built a table of 25 semi-randomly selected pitchers alongside their 2018 numbers and their respective 2019 ERA improvement probabilities. One thing that’s fairly clear, though also quite intuitive, is that it’s difficult to improve upon good performances; DeGrom, Max Scherzer, and Justin Verlander are unlikely to be better in 2019 than they were in 2018, largely because they were just so good. Applying that same intuition to the other end of the spectrum, it’s pretty easy to improve on bad performances – Clayton Richard is almost certainly going to be better in 2019 because he set the bar so low. Those are the predictable cases – the ones in which the probability model does nothing but reaffirm what we’d basically known. Among those shown in the table, the more interesting cases are those of Josh Hader and Carlos Carrasco, both of whom enjoyed incredible 2018 seasons, and are actually more likely than not to improve in 2019. There’s also a few names not shown in the table who are in the same boat as Hader and Carrasco, such as Patrick Corbin, Dellin Betances, Ross Stripling, and Edwin Diaz – all of them are likely to improve in 2019 after being phenomenal in 2018.

#aaron-sanchez, #alex-cobb, #anibal-sanchez, #carlos-carrasco, #chris-sale, #clayton-richard, #dallas-keuchel, #dellin-betances, #domingo-german, #e-f, #e-x, #edwin-diaz, #edwin-jackson, #era, #jacob-degrom, #jake-odorizzi, #jakob-junis, #joe-musgrove, #jon-gray, #jose-quintana, #jose-urena, #josh-hader, #justin-verlander, #kenta-maeda, #kyle-freeland, #madison-bumgarner, #marcus-stroman, #matt-harvey, #max-scherzer, #michael-fulmer, #mike-leake, #patrick-corbin, #pitching, #pitching-projections, #rich-hill, #robbie-erlin, #ross-stripling, #sonny-gray, #tyler-anderson, #tyler-mahle, #wade-miley, #xfip