Tuesday, September 1, 2009

Do-It-Yourself: A Quantitative Examination of Predictors of Major League Sucsess

For well over a decade, we have been working on better ways to identify prospects. We began this quest before electronic data on the Minor Leagues was available and we would meticulously copy and input data from the Sporting News. As time has passed, electronic data became more readily available. Through the dedicated work by the people at SABRs minor league committee and through Gary Cohen at the Baseball Cube we have been able to add to our historical database of Minor League statistics, to where we now have over 40 years of Minor League data.
As each year passes, and with more and more data available, we are able to take deeper, more precise looks at some of the concepts that we have spent more than a decade on. So last month we decided to redo our study on predictive stats of minor league players. The objective of the analysis was to determine which statistics from Minor League performance, are the best indicators of future Major League success. We pulled together approximately 15 different statistics for pitchers and about 20 different hitting statistics (some with multiple variations). We then compared these to the cumulative Major League performance of our control group. Additionally we performed a number of multiple variable regression analyses to try to determine statistical combinations that improve our results. We detail both the methods and the results below.

The data universe we used was comprised of Hi-A and AA players from the 1997 and 1998 Minor League seasons. These years were chosen, because we have ‘calculated’ Minor League Park Factors from 1996 to present available, and only have estimated park factors for prior seasons. We used all leagues, and included all hitters that had 120 or more ABs and all pitchers with at least 50 IP. Additionally, we excluded any player that already had Major League experience. All data was normalized for Park effect. Each of the 12 Leagues (2 years x 2 levels x 3 Leagues), was analyzed only within itself, i.e. raw statistics from Eastern League players was not compared with raw statistics from Texas League players. For each statistical variable, we calculated the number of standard deviations that value represented, above or below the league mean. We have been utilizing this practice since reading some work performed by Tony Blengino in 1977 (I believe he is currently special assistant to the GM in Seattle). One of the truisms of prospect evaluation is that Major League players come from Minor League players whose performance was positively outside the expected norms. Once we had a data universe (approximately 1500 hitters and 1100) pitchers, we set about determining a numerical value for their Major League performance. After careful consideration, we settled on using to-date Career Batting Runs created for hitters, and we approximated Pitching Runs created for pitchers.

For reference purposes, only about 30% of Minor League hitters and 35% of Minor League pitchers at these levels produced any Major League playing time at all, and only about 9% of hitters and 8% of pitchers experienced anything that could be considered above replacement level status.

Once we had assigned values to each players Major League career, we then set about determining the correlation coefficients for each of the statistics that we used. The following were the calculated correlation coefficients producing the highest results:

Pitcher Statistical Correlation Coefficients
Age – (0.26)
IP/G – 0.21
K/BFP - 0.19
K/IP - 0.18
WHIP – (0.14)
CERA – (0.14)
ERA – (0.13)
K:BB – 0.13
HR/IP – (0.08)
BB/IP – (0.05)

Hitter Statistical Correlation Coefficients
Age – ( 0.29)
RC/27 – 0.29
OPS – 0.28
2*OBA + SLG – 0.27
wOBA – 0.27
SLG – 0.26
AVG – 0.25
OBP – 0.23
H/PA – 0.23
H/BIP – 0.17
wPOW – 0.21
LWP – 0.21
ISO – 0.21
XBH/AB – 0.21
HR/AB – 0.19
K/PA – (0.15)
2B/AB – 0.14
First Base Rate(FBR) – 0.12
Speed (Diamond Futures)– 0.10
Spd(James) – 0.10
1B/AB – 0.09
BB/PA – 0.08
3B/AB – 0.05

A couple of interesting observations:
The results tend to support the plethora of research that’s been compiled over the last few years regarding the randomness of Balls in Play (BIP).
In all of the hitting outcomes: single, double , triple, etc., using ABs as the denominator produces the strongest correlations.

Further Explanations of Statistics used
CERA is the standard calculation for Component ERA.
wOBA is the Tom Tango weighted OBA formula
wPOW is a weighted Power calculation that we use at Diamond Futures. The formula is: wPOW=(doubles*3.0 + triples*1.2 + homeruns*10.0)/AB. It is scaled to approximate SLG, but focuses on only the XB components.
First Base Rate, defined as FBR=((1B/AB + BB/PA)*1.2), is an on-base calculation that we use at Diamond Futures that is scaled to approximate OBP, but is only focused on singles and walks.
Speed is a calculation we use at Diamond Futures that weights three of the components (SpdSB%, SpdSBA, SpdR) from the Bill James Speed Calculation. It is scaled to produce a speed rating that ranges from 0-10.
Speed (James) is the standard Bill James 4-component calculation (not using fielding range).

Conclusions –

Age vs. Level of competition is still the single strongest indicator of future Major League success.

Triples have surprising little correlation and little value even used in combination as a predictive factor.

While everyone continually searches for a single statistic measure, the most useful predictive statistics remain weighted combinations of simple statistics. For example, while Average has a correlation coefficient of 0.25 and it is comprised of (1B+2B+3B+HR)/AB, we can combine the variables of 1B/AB, 2B/AB, 3B/AB and HR/AB in a weighted method that yields a correlation of 0.28. Our best correlations occur when we weight and combine multiple, unrelated, measures. By this I mean it is best to avoid using both OBP and SLG because we would be duplicating too many statistics (singles, doubles, home runs, etc.) to produce precise results. Instead we want to use measures that isolate individual characteristics or variables.

While people like us continue to try to reinvent the wheel, traditional scouting has told us about the five tools of a hitter, Hit for Average, Hit for Power, Speed, Defense and Arm Strength. The numbers tend to back up the traditional intuition, with slight modification. The results indicate that there are four significant offensive characteristics that predict Major League performance: 1) The ability to make contact/reach base {we best define it by the formula for First Base Rate}; 2) The ability to hit for power {defined by the formula wPOW }; 3) The ability to judge/control the strike zone {defined by K/AB}; and 4)Speed {defined by using weighted components of the Bill James Speed formula}…When used in combination with 5) Age vs. Level of competition, we can run a regression analysis that will yield a weighted formula that produces a correlation coefficient of 0.45. It is likely that at some future date we will look at incorporating some sort of fielding/zone rating and we will then likely get correlations greater than 0.50.

The actual results produced are somewhat striking. Of the top 50 hitter names produced from the regression analysis formula, 38 went on to have significant Major League careers. Of the top 25 names, only Rueben Mateo, Cal Pickering and Dernell Stenson could be classified as 'misses'. After deriving the formulas from the 1997 and 1998 data, we then tested the results on data from 1996 and 1999 seasons and produced equally strong correlation results.

Pitching isn’t as easily evaluated, or at least doesn’t yield as strong of results, but that doesn't mean there aren't predictive measures. Again, staying away from traditional cumulative statistical measures like ERA and WHIP, we can break pitching down into five significant characteristics: 1) Age; 2) Stamina {measured by IP/G}…Major League pitchers strongly tend to come from the pool of Minor League starting pitchers; 3) Dominance {which we define using K/BFP}; 4) Ability to keep the ball down and not give up the HR {defined by HR/IP}; and 5) Ability to avoid the free pass {defined by BB/IP}. I would have liked to see the correlations of Ground Ball/Fly Ball ratios, as some recent work we have done shows some promise in this area, but we just don’t have the historical data available to test them. Once we run a regression analysis on these variables, we can weight them in such a manner that produces a correlation coefficient of 0.35.

All of these results are built on characteristics defined by one small data segment (minimum of 120 ABs or 50 IP). We have just started to look at how having multiple years/multiple data segments can be combined to produce even better results.

If you would like a slightly more detailed version of this study as a Word document, feel free to email me at baseballnumbers@ix.netcom.com.

No comments:

Post a Comment