Preweighting and Playoffs

Mar 20, 2017

As per my last Corsi improvement, the adjusted Corsi formula has been getting rather unwieldy. It's not likely to get much more predictive without some major revisions, and even then it's a bit of a pain to work with. There is, however, an alternate form that's slightly less predictive but a little more malleable. To show this I'm going to take a step back to the Corsi SA formula, which is the following:

$$C_{SA} = {\sum \limits_{n=down3}^{up3} {({F_{n} + A_{n}})({{F_{n}} \over {F_{n} + A_{n}}} - {{F_{avg_n}} \over {F_{avg_n} + A_{avg_n}}})} \over \sum \limits_{n=down3}^{up3} F_{n} + A_{n}} + 50\% $$

On a basic level, what this formula is doing is calculating the difference between a team's performance and average league performance in each score state, and then weighting that difference by the number of the team's events in that state. An alternative approach is to just weight each event by its rarity in that state:

$$C_{SA} = {\sum \limits_{n=down3}^{up3} {F_{n} {{F_{avg_n} + A_{avg_n}} \over {F_{avg_n}}}} \over \sum \limits_{n=down3}^{up3} F_{n} {{F_{avg_n} + A_{avg_n}} \over {F_{avg_n}}} + A_{n} {{F_{avg_n} + A_{avg_n}} \over {A_{avg_n}}}}$$

These formulas aren't exactly equal, but they're pretty close. The main advantage to the latter formula is that the weight terms, $$W_{F_n} = {{F_{avg_n} + A_{avg_n}} \over {F_{avg_n}}}, W_{A_n} = {{F_{avg_n} + A_{avg_n}} \over {A_{avg_n}}}$$ are terms that we can calculate using past seasons, essentially making them constants, and once we do it really simplifies the formula down to

$$C_{SA} = {\sum \limits_{n=down3}^{up3} {F_{n} W_{F_n}} \over \sum \limits_{n=down3}^{up3} F_{n} W_{F_n} + A_{n} W_{A_n}}$$

This form is also applicable to venue and event adjustment by splitting the data accross those vectors and similarly calculating a weight for them. Score adjustment creates 7 weights, venue adjustment creates 2 weights, and event adjustment creates 4 weights, totalling a multiplicative 56 weights if we use all three adjustments. I've contemplated using this form previously, but it does have a small correlational hit to it when compared to the previous formula.

Generally speaking, the calculated versions of these metrics take a little longer to mature; before about 15 games played per team (or 225 games league wide), the preweighted versions outperform the calculated ones. After that, however, there's a slight benefit to using the calculated form. My guess on that behavior is this: before 15 games, not only has the data not converged for individual teams, but it hasn't converged for the league as a whole. After that point, preweighting introduces a slight lopsidedness to the formula as it slightly overrepresents teams that shoot more and underrepresents teams that shoot less, which introduces some noise to the weights.

The difference between these two methods is so small - an averaged 0.18 percentage points in correlational R2 - that it can effectively be ignored. I'm not particularly interested in arguing which method is better as they both perform similarly. That we employ these adjustments far outperforms the specifics as to how they are adjusted. Preweighting isn't new either - I think most other sites that use adjusted numbers use the preweighted method. That being said, I've modified the site to employ the preweighted formulas for any dataset that includes less than 225 games total, or about 15 games per team. This decision is more aesthetic than anything as the preweighted formulas tend to vary a little less over smaller data sets.

What preweighting does allow me is a little more confidence in posting playoff numbers. Historically I've considered that playoffs are too small a data set to adjust - As Corsi doesn't reach it's most predictive state until after around 225 games, playoffs would never reach that number given that even if all series went to a game 7, there would only be 105 games.

The site now has an additional filter that lets you select if you want to chose regular season games, playoff games, or both. Note that if you chose any option that includes playoff games, the "Points" column will be replaced by a Win-Loss-OTL "Record" column, as points are not awarded in the playoffs. I personally haven't looked much at the playoff numbers, but they've been requested by quite a few of my viewers, so I'm hoping that making them available will allow others to find some insight.