7 Part II - Tennis2017

7.1 Introduction

7.1.1 Topic

.

.

Tennis is one of my 3 favorite sports - alongside with soccer and basketball. I have been following tennis since 2010, and have also witnessed so many great matches over the years. I actually stopped watching it for a little while after my third year, mainly due to the fact that my favorite tennis player - Roger Federer - was dealing with injury back then. But then I started paying attention to tennis again last year (2017), because Federer has made an incredible comeback and played very well. So for all of those reasons, I decided to choose a dataset about men’s tennis in 2017 and make some analysis.

7.1.2 Data

I started looking for data, and I was fortunate to find one dataset about tennis matches in 2017 available online. Here’s the link to the online dataset: https://github.com/JeffSackmann/tennis_atp/blob/master/atp_matches_2017.csv . I copied and put everything in a text file (.txt)

This dataset consists of 2886 observational units - which are the number of tennis matches in 2017. It has a total of 49 variables which can be divided into the following groups:

  • Tournament: tourney_id, tourney_name, surface, draw_size, tourney_level, tourney_date
  • Match: match_num, best_of (3 or 5), round, minutes
  • Player (winner/loser) information: id, seed, entry, name, hand (L/R), ht (height - in cm), ioc (country code), age, rank, rank_points
  • Player(winner/loser)’s stats: ace, df (double-fault), svpt (service points), 1stIn, 1stWon, 2ndWon (1st/2nd = First/Second serve), svGms (service games), bpSaved, bpFaced (bp = break points)

7.2 Analysis

As usual, before making any analysis, the following packages must be loaded.

library(tidyverse)
library(knitr)
library(mosaic)

The data file was read in from my personal folder.

ATP2017 <- read.csv("~/Data229/Project/Tennis2017/ATP2017.txt")

7.2.1 Wins and Losses

First, let’s do some simple data transformation to find out who had the most W’s and L’s last year.

kable(ATP2017 %>% 
  group_by(winner_name) %>% 
  summarise(Wins = n()) %>% 
  arrange(desc(Wins)) %>% 
  head(8))
winner_name Wins
Rafael Nadal 67
David Goffin 59
Alexander Zverev 55
Roger Federer 53
Grigor Dimitrov 49
Roberto Bautista Agut 48
Dominic Thiem 47
Marin Cilic 45

Rafael Nadal - who ended 2017 as the number 1 ranked single tennis player - led the way with 67 wins. We also found out that 4 players had 50+ wins last year: Nadal, D.Goffin, A.Zverev, R.Federer.

kable(ATP2017 %>% 
  group_by(loser_name) %>% 
  summarise(losses = n()) %>% 
  arrange(desc(losses)) %>% 
  head(8))
loser_name losses
Paolo Lorenzi 35
Joao Sousa 32
Mischa Zverev 32
Albert Ramos 31
Benoit Paire 31
Kyle Edmund 30
Robin Haase 30
Jan Lennard Struff 29

Paolo Lorenzi had the most losses in 2017 with 35 L’s. 7 players got defeated 30 or more times last year.

7.2.2 Match Duration

Next, I wanted to analyze the length of 2017 matches. I devided this into 2 separate parts: Best of 3 and Best of 5

7.2.2.1 Best of 3 Matches

Best of 3 matches are matches on the ATP (A) and Masters (M) levels. Below are the numerical and visual summaries of the duration (in minutes) of BO3 matches:

Bestof3 <- ATP2017 %>% 
  filter(tourney_level == c("A", "M"))
kable(favstats(~ minutes, data = Bestof3))
min Q1 median Q3 max mean sd n missing
0 73 92 120 192 97.18857 32.01734 1050 14
Bestof3 %>%
  filter(!is.na(minutes)) %>% 
  ggplot(mapping = aes(minutes)) +
  geom_histogram(bins = 25, color = "black", fill = "tan")

The minute distribution of BO3 matches looks kind of normalish. The mean duration is about 97 minutes (1 hour and 37 minutes). The median duration is 92, which means half of the matches took place for more than 92 minutes and the other half was played in less than an hour and 32 minutes. The minimum time is 0, which means there was a withdrawal before the match started; whereas the maximum time is 192 minutes, meaning the longest Best of 3 match in 2007 lasted for 3 hours and 32 minutes.

7.2.2.2 Best of 5 Matches

Best of 3 matches are Grand Slam (G) and Davis Cup (D) matches. Below are the numerical and visual summaries of the duration (in minutes) of BO5 matches:

Bestof5 <- ATP2017 %>% 
  filter(tourney_level == c("G", "D"))
kable(favstats(~ minutes, data = Bestof5))
min Q1 median Q3 max mean sd n missing
28 109 134 168.25 296 141.4973 49.01037 368 13
Bestof5 %>%
  filter(!is.na(minutes)) %>% 
  ggplot(mapping = aes(minutes)) +
  geom_histogram(bins = 25, color = "black", fill = "orange")

The minutes distribution of BO5 matches also has a normalish shape (and more normal than BO3’s). 141 minutes (2 hours and 21 minutes) is the average match duration. The median time is 134, which means half of the matches lasted longer than 134 minutes and the other half occured in less than 2 hours and 14 minutes. The minimum time is 28, which means the match actually took place and player(s) might have suffered from some sort of injury at the 28-minute mark. On the other hand, the longest match was lasted for 296 minutes (4 hours and 56 minutes) - almost 5 hours long.

7.2.3 Number of Aces vs Height

My theory about tennis is that in order to win, you have to serve well. One of the main figures to show how well you serve is the number of aces you have in a match. And so ace is my favorite tennis stat. I also believe that taller players, because of their height advantage, tend to serve better and have more aces than shorter players. So I want to find out if there is any correlation between aces and height.

My goal is to have a table in narrow format with 2 variables “ace” and “ht”. So I first selected the 4 variables “w_ace”, “l_ace”, “winner_ht” and “loser_ht” from the original table and then used 2 gather()’s and did some filtering to turn it into a tall table.

AcevsHt <- ATP2017 %>% 
  select(w_ace, winner_ht, l_ace, loser_ht) %>% 
  gather(w_ace, l_ace, key = who, value = ace) %>% 
  gather(winner_ht, loser_ht, key = who, value = ht) %>% 
  select(-who) %>% 
  filter(!is.na(ace), !is.na(ht))

Now let’s take a quick look at the table that was just created.

kable(AcevsHt %>% head())
ace ht
7 188
4 188
1 178
23 196
3 188
3 178

Below is the ht vs ace plot:

AcevsHt %>% 
  ggplot(mapping = aes(x = ace , y = ht)) +
  geom_point() +
  stat_smooth(method = "lm", se = FALSE, size = 2)

lm(ht ~ ace, data = AcevsHt)

Call:
lm(formula = ht ~ ace, data = AcevsHt)

Coefficients:
(Intercept)          ace  
   184.2134       0.3199  

The plot above reveals a positive but weak relationship between aces and height. So we can say that the overall trend is the taller the player is, the more aces he makes. The regression equaiton is ht^ = 0.32*ace. This equation has a slope of 0.32, meaning that every extra ace is associated with 0.32 cm of taller height.

7.2.4 Roger Federer

Like I mentioned in my introduction, my favorite tennis player is Roger Federer. So I want to investigate a little bit in his successful 2017 season.

7.2.4.1 Head to Head

First, let’s find out who got beaten by Federer the most in 2017.

kable(ATP2017 %>% 
  filter(winner_name == "Roger Federer") %>% 
  group_by(loser_name) %>% 
  summarise(Total_Wins = n()) %>% 
  filter(Total_Wins > 1) %>% 
  arrange(desc(Total_Wins)) %>% 
  head())
loser_name Total_Wins
Rafael Nadal 4
Francis Tiafoe 3
Juan Martin Del Potro 3
Mischa Zverev 3
Tomas Berdych 3
Alexander Zverev 2

Surprisingly, Federer defeated his main rival Rafael Nadal 4 times, his most against any opponents in 2017. He also had 3 victories over 4 players (Tiafoe, Del Potro, Zverev and Berdych).

7.2.4.2 Surface Performance

Next, let’s look at Federer’s surface performance. Here are the numerical and visual summaries:

kable(ATP2017 %>% 
  filter(winner_name == "Roger Federer") %>% 
  group_by(surface) %>% 
  summarise(Matches = n()))
surface Matches
Grass 12
Hard 41
ATP2017 %>% 
  filter(winner_name == "Roger Federer") %>% 
  group_by(surface) %>% 
  summarise(Matches = n()) %>% 
  ggplot(mapping = aes(x = "", y = Matches, fill = surface)) +
  geom_bar(stat = "identity") +
  coord_polar("y", start = 0) + scale_fill_brewer(palette ="Accent") +  theme_minimal() +
  xlab("") + ylab("")

In 2017, Federer won 41 matches on hard courts and 12 matches on grass courts. These 2 are his favorite surfaces. One more thing we need to notice is that Federer didn’t have any wins on clay, due to the fact that he skipped the entire clay-court season.

7.2.4.3 Aces

As I mentioned in the previous part, my favorite tennis stat is number of aces. Roger Federer is a great server and is number 2 on the career aces list. So I wanted to find out about Federer’s aces in 2017.

The first thing I did was creating a table in the narrow format with only 2 variables Ace and Player, and then got rid of every name but Roger Federer.

FedAce <- ATP2017 %>% 
  select(winner_name, loser_name, w_ace, l_ace) %>% 
  gather(w_ace, l_ace, key = who, value = ace) %>% 
  gather(winner_name, loser_name, key = who, value = Player) %>% 
  select(-who) %>% 
  filter(Player == "Roger Federer")
kable(FedAce %>% head())
ace Player
19 Roger Federer
17 Roger Federer
8 Roger Federer
24 Roger Federer
9 Roger Federer
11 Roger Federer

Now I had my table, and it’s time to check out the distribution of Federer’s 2017 aces.

FedAce %>% 
  ggplot(mapping = aes(ace)) +
  geom_histogram(bins = 25, color = "black", fill = "grey")

kable(favstats(~ ace, data = FedAce))
min Q1 median Q3 max mean sd n missing
0 4 6.5 11 24 7.596491 4.807294 114 2

The shape of Federer’s 2017 aces distribution is skewed to the right. His average number of aces last year were 7.6, so about 8 aces per match. His highest number of aces in one match was 24. There were some matches that Federer did not have a single ace, that’s why the minimum aces in 0.