clock menu more-arrow no yes

Filed under:

La Liga predictions review as Villarreal live to regret slow start

A full review of this season’s La Liga predictions model with Charlie Tuley

FC Barcelona v Villarreal CF - Liga Santander
Emery’s men struggled in La Liga this season
Photo by Joan Valls/Urbanandsport /NurPhoto via Getty Images

When Mitchel first asked me to pick up his match prediction model and use it for the 2021/2022 LaLiga season, I had no idea what to expect.

I figured that it would be something that I would mess about with for a few weeks and then lose interest and gradually stop posting about it.

However, it was the (mostly negative) reception of the model early on from many of my Twitter followers that got me completely invested in it.

Though I fall somewhere near the middle of the spectrum on the data/eye test debate, I wanted to prove to the anti-statistics people that the numbers and data actually have a part to play in football analysis.

While the model has had its ups and downs this season, it certainly proved its merit over the course of the year.

As football has three results (either team can win, as well as draw), it becomes much more difficult to predict than most American sports that always end with one team beating another.

Given that there were three options that the model could predict for each match, my big-picture goal was for the model to predict at least 33% of La Liga’s matches correctly.

As can be seen by the graphic above, 33% was an extremely disrespectful estimate. At only one point did the model have a correct prediction percent of less than 40%, with that being very early on (Matchday 2, with the model having predicted seven of the first twenty matches correctly).

Needless to say, I was very happy with the overall performance of the model.

In the end, our model correctly guessed the result of 174 of the 380 matches this season, an average of 4.58 (45.8%) correct predictions each matchday.

The betting company that we used as a comparison, DraftKings, ended up correctly predicting the results of 197 of the matches this season, or 5.18 (51.8%) accurate guesses each matchday.

Given that our model was 100% numbers-based and we did not have accurate Expected Goals data to start the season, I am not concerned with the gap between our model and the DraftKings setup.

Along with trying to accurately predict how each match would end, our model also tried to predict the exact score of each game.

As expected, this is much more difficult than just trying to predict which team will win (or draw), as there are many, many different score lines that a match can end with.

Mitchel’s model went with a matrix of possible scores from either team running from zero to 5+, so we had thirty-six different score lines available to us.

I had little to no expectations for this part of the model, and our end result of 47 correctly predicted score lines (12.4% overall) was very impressive in my eyes.

We also kept track of how many times our model correctly predicted the outcome of a match that the DraftKings bettors incorrectly predicted.

These occurrences came few and far between, as most of our predictions tended to be similar to those over at DraftKings. By the end of the season, we had beaten the bettors in 27 matches this season, or 7.11% of this season’s games.

I also wanted to see if the model found it easier to predict certain teams’ games over the course of the season.

As expected, La Liga’s “Big Three” (Real Madrid, Barcelona, and Atlético Madrid) were the teams that the model had the easiest time correctly picking results, which makes sense given that these teams tend to create a lot of chances (high average xG) and restrict the amount of chances their opponents create (low average xG against).

Funnily enough, Rayo Vallecano was also among the teams with the most correct predictions, which also makes sense when you understand the model on a functional level.

Our model tended to favor home teams (due to the slight xG advantage we gave home teams), and Rayo went unbeaten at home in the first half of the season.

This, coupled with Rayo losing just about every game where they played an opponent who should be beating them, really racked up the amount of correct predictions we had for the Madrid club.

Digging deeper, I wanted to see if there was any relationship between finishing higher (or lower) in the La Liga standings and the number of correct predictions we had for a team.

I ran a quick regression on the data set, and I found the r-squared value to be 0.352. This means that 35.2% of the variance seen in the graph above regarding correct predictions is explained by the league position.

While there is a slight correlation between finishing higher in the table and garnering more correct predictions, it is not strong enough to assume that the two are directly correlated.

Finally, given that this is a Villarreal-centric site, I thought I’d use the data I’ve accrued to generate at least one Villarreal-based visualization for this article.

Above we can see a rolling average of Villarreal’s non-penalty expected goal difference over the course of the season.

Immediately, you can see the biggest disappointment of Villarreal’s domestic campaign, the side’s slow start.

Despite having a healthy (and positive) rolling average npxG from the get-go, Villarreal were not able to capitalize on their chances early, drawing five of their first six matches.

However, the graph accurately details the Yellow Submarine’s hot streak near the middle of the season, along with the point where they began to tail off near the end of the season as they were trying to balance their Champions League run with their domestic fixtures.

On review

I’m not sure if I will continue running this model next season, but if I do, there are a few limitations that I’d like to clean up.

The first and by far the most pressing issue was the lack of proper data that we started the season with. Given that the model ran on a five-game rolling average of a team’s non-penalty xG difference, the first five matches of the season had insufficient data.

To remove this issue, the model would have to start from Matchday Five.

Another issue that I faced stemmed from the use of the DraftKings’ model.

The odds-makers are constantly changing their odds in the leadup to matchday, and the date and time that I scraped the DraftKings’ website would make a difference in how teams were favoured.

Going forward, I would need to set up a consistent schedule of when exactly to get the data from the website, which would likely reduce any variance from the bettor’s side of things.

I really enjoyed messing around with this model over the course of this season, and though I probably won’t do it again next year, I would love to see someone else take up the reins on the project.

If anyone would be interested, reach out to me (@AnalyticsLaLiga) or Mitchel (@MSocAnalytics) on Twitter, and hopefully we can see a new and improved model for the 2022/2023 LaLiga season.