World Cup - Scoring Analysis

After looking at the host advantage and winner's curse in a previous article, I wanted to look at the full set of match history, along with the frequency and timing of goals in those matches.

library(ggplot2)
library(plotly)

Data

The data sets we'll use in this article are the following:

matches_cleaned.rds
goals_cleaned.rds

These data come from FIFA's World Cup Archive. The matches data set has the local start time of the match separated into year, month, hour, and minute, followed by the stage/round of the match, the home/away team, home/away score, score total, score difference, and a reference link back to the archive page containing information about the match.

Here's a preview of the data set, noting that home/away designations are from FIFA's data, and I'm not necessarily sure whether they're meaningful in a neutral-site World Cup match.

matches <- readRDS(
matches_cleaned.rds
)
matches
0 items

A note about the stage variable: since the earliest round of the World Cup has varied between a round-robin group format and an entirely single-elimination tournament, group stages are labeled "Group", and FIFA designations of "First round" and "Preliminary round" have been combined into the label "Round of 16", since they were effectively knockout games.

Similarly, here's a preview of the goals data set. In addition to the variable shown below, each record has the two teams in the match and the cumulative score as of the goal listed (in other words, score is never 0-0 in this data set - it would begin at 1-0 on the first goal in the match). There is also a flag for whether a goal broke a tie.

goals <- readRDS(
goals_cleaned.rds
)
goals[c('player', 'team', 'minute', 'stoppage_time', 'stoppage_minute', 'penalty', 'own_goal', 'extra_time', 'regulation_time', 'minute_rounded')]
0 items

Match-level scoring

To start, we can calculate the maximum total match score, since that will be useful across many of the plots to follow, and then plot a histogram of total scoring.

score_total_max <- max(matches$score_total)
bar_alpha <- 0.25
bar_color <- 'tomato'
# Distribution of total match scoring
ggplotly(ggplot(
  data = matches,
  mapping = aes(
    x = score_total)) +
  geom_bar(
    alpha = bar_alpha,
    color = bar_color,
    fill = bar_color) +
  scale_x_continuous(
    limits = c(-0.5, score_total_max),
    breaks = seq(0, score_total_max)) +
	labs(
  	x = 'Total match score',
  	y = 'Count') +
  theme_bw())
Loading viewer…

Scoring over time

How has this changed over time? Are matches in 2018 expected to produce the same action as in 1930?

# Distribution of goals scored over time
ggplotly(ggplot(
  data = matches,
  mapping = aes(
    x = year,
    y = score_total)) +
  geom_boxplot(
    aes(group = year),
  	alpha = bar_alpha,
    color = bar_color,
    fill = bar_color) +
  geom_smooth() +
	scale_x_continuous(
  	breaks = seq(1930, 2014, by = 8)) +
  scale_y_continuous(
    breaks = seq(0, score_total_max),
    limits = c(0, score_total_max)) +
	labs(
  	x = 'Year',
  	y = 'Total match score distribution') +
	theme_bw())
Loading viewer…

I had two initial reactions to this: first, there's clearly a trend of decreasing match scoring from the first five World Cups into the more recent era (though the median goals per game did jump from 2 to 3 in 2014). Second, were there really no 0-0 ties until 1958?

Sure enough, it's true, as the following table and plot show, and to make it worse, the past three World Cups have tied the record for goal-less games, with seven each.

matches$zero_goals <- matches$score_home == 0 & matches$score_away == 0
zero_goals_by_year <- aggregate(
  x = matches['zero_goals'],
  by = matches['year'],
  FUN = sum)
# Zero-goal matches over time
ggplotly(ggplot(
  data = zero_goals_by_year,
  mapping = aes(
    x = year,
    y = zero_goals)) +
  geom_col(
    alpha = bar_alpha,
    color = bar_color,
    fill = bar_color) +
  scale_x_continuous(
    breaks = seq(1934, 2014, by = 8)) +
	labs(
  	x = 'Year',
  	y = 'Zero-goal matches') +
  theme_bw())
Loading viewer…

Scoring by round

What about scoring by round? We should expect more blowouts in the early stages, where first-time qualifiers might meet a team like Brazil. We might also expect more conservative play in the knockout rounds, although that may be counterbalanced by the fact that only the best offensive teams are likely to be remaining as the tournament proceeds.

# Scoring by tournament round/stage
ggplotly(ggplot(
  data = matches,
  mapping = aes(
    x = stage,
    y = score_total)) +
  geom_boxplot(
  	alpha = bar_alpha,
    color = bar_color,
    fill  = bar_color) +
  scale_y_continuous(
    breaks = seq(0, score_total_max),
    limits = c(0, score_total_max)) +
	labs(
  	x = 'Tournament stage',
    y = 'Total match score distribution') +
  theme_bw())
Loading viewer…

Early-round blowouts are confirmed, though surprisingly the maximum total match score occurred in the quarterfinals.

It's also interesting that even though more blowouts happen in early rounds, the total score quartiles actually move up in the semifinals and final, and the median match score increase from 3 to 4 for the final.

To look in detail, let's show the matches with at least 9 goals:

matches_9_goals <- matches[matches$score_total >= 9, ]
sort_order <- rev(order(matches_9_goals$score_total))
matches_9_goals[sort_order, 
                c('year', 'stage', 'team_home', 'team_away', 'score_home', 'score_away')]
0 items

As discussed, eight of the ten matches were group or first-round matches, with the lone quarterfinal match a 7-5. Central and eastern European countries dominate this list - hold onto that thought.

Scoring by team

Are there traditional scoring powerhouse countries, and some that prefer to hold back?

If we look at the top 10 countries by goals per game:

# Lengthen data set to one observation per match-team
matches_home <- matches[c('year', 'month', 'team_home', 'score_home')]
colnames(matches_home) <- gsub('_home', '', colnames(matches_home))
matches_away <- matches[c('year', 'month', 'team_away', 'score_away')]
colnames(matches_away) <- gsub('_away', '', colnames(matches_away))
matches_teams <- rbind(matches_home,
                       matches_away)
# Calculate mean score by team
mean_score_by_team <- aggregate(
  x = matches_teams['score'],
  by = matches_teams['team'],
  FUN = mean,
  na.rm = TRUE)
mean_score_by_team$score <- round(mean_score_by_team$score, 3)
# Print top 10
sort_order <- rev(order(mean_score_by_team$score))
head(mean_score_by_team[sort_order, ], 10)
0 items

...we find that Hungary, a team that has never hosted and never won a World Cup, not qualified since 1986, and not finished in the top 10 since 1966, holds a nearly 0.6 goals-per-game lead over traditional powers Brazil, Germany, and the former West Germany ("Germany FR" in the FIFA data).

It's also fascinating that average scoring for West Germany and Germany differed by just 0.001 goals per game, across sample sizes of 62 and 44 matches, respectively. German precision, indeed.

Note that countries change name (and territory) over time for political reasons, and FIFA respects this in its records. So the former East and West Germany maintain separate labels prior to reunification, although it has little effect since East Germany only appeared in the 1974 World Cup (hosted by West Germany).

The spectrum of the scoreless

Five countries have never scored a World Cup goal, and all of them have played no more than three total matches:

sum_score_by_team <- aggregate(
  x = matches_teams['score'],
  by = matches_teams['team'],
  FUN = sum,
  na.rm = TRUE)
zero_score_teams <- sum_score_by_team[sum_score_by_team$score == 0, ]
matches_by_team <- aggregate(
  x = setNames(matches_teams['score'], 'matches'),
  by = matches_teams['team'],
  FUN = length)
merge(
	zero_score_teams,
  matches_by_team)
0 items

But I found it interesting that this set of five countries contains 1) the world's most populous nation, 2) four of the 14 largest countries by area (Canada is 2nd, China 3rd, Zaire 11th, and Dutch East Indies 14th)...and 3) a tiny two-island nation.

I suppose people and land area don't translate into any more quality play - it's almost like coaching, cultural interest, and player development actually matter!

Goal-level analysis

Let's go down a level and start to look at individual goals and timing.

Scoring by time

At the most basic level, when do goals occur during matches, restricting to regulation time?

# Histogram of goals in regulation time, in 5-minute intervals
ggplotly(ggplot(
  data = goals[goals$regulation_time, ],
  mapping = aes(
    x = minute)) +
  geom_histogram(
    alpha = bar_alpha,
    color = bar_color,
    fill = bar_color,
    breaks = seq(0, 90, by = 5)) +
  geom_vline(
    xintercept = 45,
    color = 'black',
    linetype = 'dashed') +
  scale_x_continuous(
    breaks = seq(0, 90, by = 15),
    limits = c(0, 90)) +
	labs(
  	x = 'Minute',
  	y = 'Count') +
  theme_bw())
Loading viewer…

Some notes here: the first five minutes of matches produce the fewest goals of any interval, slowly building to a first-half peak between the 16th and 20th minute.

Then, the final five intervals in the second half each produce more goals than any five-minute period of the first half. But are these meaningful goals? Are they runaway goals in first-round blowouts or tie-breakers in the knockout round? One way to check is to split this plot by round, which I've done below.

# Merge match year and stage onto goals data set by match_link
goals_matches <- merge(
  goals,
  matches[c('match_link', 'year', 'stage')],
  all.x = TRUE)
goals_matches$first_round <- ifelse(
  goals_matches$stage == 'Group',
  'Group',
  'Other rounds')
ggplotly(ggplot(
  data = goals_matches[goals_matches$regulation_time, ],
  mapping = aes(
    x = minute)) +
  geom_histogram(
    alpha = bar_alpha,
    color = bar_color,
    fill = bar_color,
    breaks = seq(0, 90, by = 5)) +
  geom_vline(
    xintercept = 45,
    color = 'black',
    linetype = 'dashed') +
  facet_grid(first_round ~ .) +
  scale_x_continuous(
    breaks = seq(0, 90, by = 15),
    limits = c(0, 90)) +
	labs(
  	x = 'Minute',
  	y = 'Count') +
  theme_bw())
Loading viewer…

I don't think there's much to talk about here - both of the plots show a gradual increase form the open of the match to minutes 16-20, then a dip to halftime, and another increase to the end of the game, although the relative frequency of goals in the last 5 minutes of regulation is slightly higher in the knockout rounds.

What about looking at a frequency of goals in minutes 1-65 versus 66-90, by round? Shouldn't we expect a little more urgency to score near the end of the game in the knockout round, all else held equal?

goals_matches$minute_66_on <- factor(ifelse(
  goals_matches$minute > 66,
  'Minute 66+',
  'Before minute 65'))
chisq_result <- chisq.test(
	x = goals_matches$first_round[goals_matches$regulation_time],
  y = goals_matches$minute_66_on[goals_matches$regulation_time])
table(goals_matches$stage[goals_matches$regulation_time],
  goals_matches$minute_66_on[goals_matches$regulation_time])
chisq_result <- chisq.test(
	x = goals_matches$first_round[goals_matches$regulation_time],
  y = goals_matches$minute_66_on[goals_matches$regulation_time])

So maybe there's no hard evidence that scoring behavior in the final 25 minutes is different for first-round games, but with the right data set, I would love to take this further and see if the surge in goals from minutes 61-65 to 66-70 is well-explained by substitutions.

Path of scoring

I was also interested in scoring paths across all the World Cup games. This is a visualization I'm still struggling to get right, so I'll show a few different versions.

First, there's a step-chart with each goal in the minute it occurred:

# Step chart for goal totals
ggplotly(ggplot(
  data = goals,
  mapping = aes(
    x = minute,
    y = score_difference,
    group = match_link)) +
  geom_step(
    alpha = 0.25,
    color = 'red',
    position = position_jitter(
      height = 0.25,
      width = 0)) +
  labs(
    x = 'Minute',
    y = 'Score difference (goals)') +
  theme_bw())
Loading viewer…

But that's a little too jumbly, so I tried rounding the minute to the nearest 5-minute interval:

ggplotly(ggplot(
  data = goals,
  mapping = aes(
    x = minute_rounded,
    y = score_difference,
    group = match_link)) +
  geom_step(
    alpha = 0.25,
    color = 'red',
    position = position_jitter(
      height = 0.25,
      width = 0.25)) +
  labs(
  	x = 'Minute',
    y = 'Score difference (goals)') +
  theme_bw())
Loading viewer…

But I still think there's some direction missing here - there's so much overlap that you can't tell how many games are moving up or down on the chart at each interval.

So I tried a dot plot next ordered by the occurrence of the first goal, with minutes on the x-axis and the rest of the goals on the same y-value for each match:

# All goals, matches ordered by minute of goals
sort_order <- do.call(order, goals[c('minute', 'score_total')])
goals_ordered <- goals[sort_order, ]
match_links <- unique(goals_ordered['match_link'])
match_links$match_id <- seq_len(nrow(match_links))
goals_ordered <- merge(
  goals_ordered,
  match_links)
ggplotly(ggplot(
  data = goals_ordered,
  mapping = aes(
    x = minute,
    y = match_id)) +
  geom_point(
  	size = 1) +
  labs(
  	x = 'Minute',
  	y = 'Match') +
  theme_bw())
Loading viewer…

This puts the matches in an interesting order by first goal (nearly half of all matches have a goal by minute 25, but only 3/4 of matches have a goal after 50 minutes), but the subsequent goals might as well be in random order.

So this is still a work in progress to find a pattern in scoring over the course of matches.

Scoring difference and odds of winning

I was also curious to see the proportion of teams winning with a 1-0 lead, 2-1 lead, 3-0 lead, etc. So I constructed a dataset with the score at each minute (up to 90) in every game, and calculated the win percentage for each score combination at each minute (limited to minutes where such a score exists - for example, there's no data for a 5-0 lead after 1 minute).

# All matches, all minutes, cumulative score
matches_minutes <- expand.grid(
  match_link = matches$match_link,
  minute = seq(1, 90, by = 1),
  stringsAsFactors = FALSE)
matches_minutes <- merge(
  matches_minutes,
  goals[c('match_link', 'minute', 'team_1', 'team_2', 'final_team_1', 'win_team_1', 'draw')],
  all.x = TRUE)
matches_minutes$team_1[is.na(matches_minutes$team_1)] <- FALSE
matches_minutes$team_2[is.na(matches_minutes$team_2)] <- FALSE
matches_minutes$score_team_1 <- ave(
  x = matches_minutes$team_1,
  matches_minutes$match_link,
  FUN = cumsum)
matches_minutes$score_team_2 <- ave(
  x = matches_minutes$team_2,
  matches_minutes$match_link,
  FUN = cumsum)
matches_minutes$win_team_1 <- ave(
  x = matches_minutes$win_team_1,
  matches_minutes$match_link,
  FUN = function(x) {
    if (any(!is.na(x))) x[!is.na(x)][1]
    else FALSE
  })
# Limit to scores 4-4 or lower, to avoid giving up too much plot room to low-N combinations
win_probability <- matches_minutes[
  matches_minutes$score_team_1 <= 4 &
  matches_minutes$score_team_2 <= 4, ]
win_probability <- unique(win_probability[c('match_link', 'minute', 'score_team_1', 'score_team_2', 'win_team_1')])
win_probability <- aggregate(
  x = win_probability['win_team_1'],
  by = win_probability[c('minute', 'score_team_1', 'score_team_2')],
  FUN = mean,
  na.rm = TRUE)
# Create a labels data set since faceting doesn't look great in this case.
labels <- unique(win_probability[c('score_team_1', 'score_team_2')])
labels$label <- paste0(labels$score_team_1, '-', labels$score_team_2)
g <- ggplot(
  data = win_probability,
  mapping = aes(
    x = minute,
    y = win_team_1)) +
  geom_line(
    color = 'tomato') +
  geom_text(
    data = labels,
    mapping = aes(
      label = label),
    x = 10,
    y = 0.1,
    size = 3,
    color = '#777777') +
  facet_grid(score_team_2 ~ score_team_1,
             as.table = FALSE) +
  scale_x_continuous(
    breaks = c(15, 45, 75)) +
  scale_y_continuous(
    breaks = c(0.25, 0.75)) +
	labs(
  	x = 'Minute',
  	y = 'Win probability') +      
  theme_bw() +
  # Removing facet label strip, using geom_text for labeling instead
  theme(
    strip.background = element_blank(),
    strip.text = element_blank())

A couple of interesting findings here: having a 1-0 lead gives at least 75% chance of winning the game, at all points during the game. I'm not surprised by that in the second half, but I am surprised a 1-0 lead after 15 minutes is still an 80% predictor of winning.

Similarly, a 2-0 lead is worth at least an 80% win probability, at every minute of the game.

No team has ever lost a World Cup game with a 4-0 lead.

But there must have been at least one game with a 3-0 lead after 15 minutes where the leading team lost. Let's see if we can find it in the data:

blown_3_0_index <- goals$score_team_1 == 3 & goals$score_team_2 == 0 & !goals$win_team_1
blown_3_0_matches <- goals$match_link[blown_3_0_index]
matches[matches$match_link %in% blown_3_0_matches, 
        c('year', 'team_home', 'team_away', 'score_home', 'score_away', 'stage')]
0 items

There were actually two, both in the group stage: comebacks by Colombia against the Soviet Union in 1962, and by Uruguay against Senegal in 2002.

Runtimes (1)