#### Background

In case you aren't familiar, 'regression toward the mean' roughly means that if a random variable is an outlier, a future instance is likely to be closer to the mean. For a really simple model to make this easy to understand for something like NFL player performance, imagine that each player's performance is X% skill and Y% luck. If X is 100 and Y is 0, then previous years will nearly perfectly predict future years. If Y is 100 and X is 0, then there will be no relationship between performance from one year to the next. If X and Y are both between 0 and 100, there will be some relationship between performance from year to year but it won't be perfect.

There are two easy ways for me to look at this phenomenon:

- plot one year's performance against the previous year's along with a line with a slope of 1 (X = 100%) and a best-fitting line
- bin the data by previous year's performance and look at how each bin shifted in the next year

What might we see? There are many possibilities, but here are a few examples:

- "Players that performed well perform even better the next season": plot 1 will show a slope greater than 1 and plot 2 will show the bottom bin doing worse and the top bin doing better
- "Performance is driven by skill so it's the same year-to-year": plot 1 will show a slope of 1 and plot 2 will show all bins at roughly zero
- "Performance is a mix of skill and luck so top performers will move back towards average and poor performers will move up towards average (
**this is the regression toward the mean case**)": plot 1 will show a slope between 0 and 1, and plot 2 will show the bottom bin doing better and the top bin doing worse - "It's all random/luck": plot 1 will show a slope of ~0 and plot 2 will show all bins at roughly 0
- "Poor performers overcompensate and end up better than average next season": plot 1 will show a slope less than 1 and plot 2 will show the bottom bin doing better and the top bin doing worse

To test it out I ran with 5 different stats using data from all starters from 2000-2020. For example, for a 2010-2011 compare, year 1 is 2010 and year 2 is 2011. You would expect the best performers in 2010 to do a bit worse in 2011, and the worst in 2010 to do a bit better in 2011. In the bar plots, the 'bottom third' means the 33% of players that were worst in season 1 from the plot above.

#### Results

and the data show regression toward the mean. Every stat I've tried (with a luck component obviously) followed the pattern above.

## 0 comments:

## Post a Comment