Catching Kareem

Lighting round of basketball analysis!

My friend and coworker Brad, who designed this very blog, is a sports fan and curious person. He wanted to know whether Lebron James is on track to overtake NBA all-time high scorer Kareem Abdul-Jabbar (38,387 career points!) in average number of points scored per game. He threw Kevin Durant in as a third point of comparison.

So our question is: who’s on track to unseat Kareem?

library(here)
library(tidyverse)
library(rvest)  

We’ll use Basketball Reference which apparently is an authority on these sorts of things.

There’s no way to download data as far as I could tell, so a quick check that we can scrape some of their tables.

robotstxt::paths_allowed("https://www.basketball-reference.com/")
## 
 www.basketball-reference.com
## [1] TRUE

Getting Data

Each player on Basketball Reference here has their own page.

So for our three guys let’s stick their urls and names in two few character vectors.

kareem <- "https://www.basketball-reference.com/players/a/abdulka01.html"
lebron <- "https://www.basketball-reference.com/players/j/jamesle01.html"
durant <- "https://www.basketball-reference.com/players/d/duranke01.html"

player_urls <- c(kareem, lebron, durant)
player_names <- c("Kareem", "Lebron", "Durant")

Now we can scrape each url using rvest and the HTML tag table and throw it in a named list, giving it the name nsm[i], i.e. the player’s name we defined in player_names.

get_tables <- function(nms = player_names,
                       urls = player_urls) {
  out <- NULL
  
  for (i in seq_along(urls)) {
    this <- 
      read_html(urls[i]) %>% 
      html_nodes("table") %>% 
      html_table() 
    
    names(this) <- nms[i]
    
    out <- out %>% append(this)
  }
  return(out) 
}

Let’s save that in an all_stats list

all_stats <- get_tables()
glimpse(all_stats)
## List of 3
##  $ Kareem:'data.frame':  24 obs. of  30 variables:
##   ..$ Season: chr [1:24] "1969-70" "1970-71" "1971-72" "1972-73" ...
##   ..$ Age   : int [1:24] 22 23 24 25 26 27 28 29 30 31 ...
##   ..$ Tm    : chr [1:24] "MIL" "MIL" "MIL" "MIL" ...
##   ..$ Lg    : chr [1:24] "NBA" "NBA" "NBA" "NBA" ...
##   ..$ Pos   : chr [1:24] "C" "C" "C" "C" ...
##   ..$ G     : int [1:24] 82 82 81 76 81 65 82 82 62 80 ...
##   ..$ GS    : int [1:24] NA NA NA NA NA NA NA NA NA NA ...
##   ..$ MP    : num [1:24] 43.1 40.1 44.2 42.8 43.8 42.3 41.2 36.8 36.5 39.5 ...
##   ..$ FG    : num [1:24] 11.4 13 14.3 12.9 11.7 12.5 11.1 10.8 10.7 9.7 ...
##   ..$ FGA   : num [1:24] 22.1 22.5 24.9 23.3 21.7 24.4 21.1 18.7 19.4 16.8 ...
##   ..$ FG%   : num [1:24] 0.518 0.577 0.574 0.554 0.539 0.513 0.529 0.579 0.55 0.577 ...
##   ..$ 3P    : num [1:24] NA NA NA NA NA NA NA NA NA NA ...
##   ..$ 3PA   : num [1:24] NA NA NA NA NA NA NA NA NA NA ...
##   ..$ 3P%   : num [1:24] NA NA NA NA NA NA NA NA NA NA ...
##   ..$ 2P    : num [1:24] 11.4 13 14.3 12.9 11.7 12.5 11.1 10.8 10.7 9.7 ...
##   ..$ 2PA   : num [1:24] 22.1 22.5 24.9 23.3 21.7 24.4 21.1 18.7 19.4 16.8 ...
##   ..$ 2P%   : num [1:24] 0.518 0.577 0.574 0.554 0.539 0.513 0.529 0.579 0.55 0.577 ...
##   ..$ eFG%  : num [1:24] 0.518 0.577 0.574 0.554 0.539 0.513 0.529 0.579 0.55 0.577 ...
##   ..$ FT    : num [1:24] 5.9 5.7 6.2 4.3 3.6 5 5.5 4.6 4.4 4.4 ...
##   ..$ FTA   : num [1:24] 9.1 8.3 9 6.1 5.2 6.6 7.8 6.5 5.6 5.9 ...
##   ..$ FT%   : num [1:24] 0.653 0.69 0.689 0.713 0.702 0.763 0.703 0.701 0.783 0.736 ...
##   ..$ ORB   : num [1:24] NA NA NA NA 3.5 3 3.3 3.2 3 2.6 ...
##   ..$ DRB   : num [1:24] NA NA NA NA 11 11 13.5 10 9.9 10.2 ...
##   ..$ TRB   : num [1:24] 14.5 16 16.6 16.1 14.5 14 16.9 13.3 12.9 12.8 ...
##   ..$ AST   : num [1:24] 4.1 3.3 4.6 5 4.8 4.1 5 3.9 4.3 5.4 ...
##   ..$ STL   : num [1:24] NA NA NA NA 1.4 1 1.5 1.2 1.7 1 ...
##   ..$ BLK   : num [1:24] NA NA NA NA 3.5 3.3 4.1 3.2 3 4 ...
##   ..$ TOV   : num [1:24] NA NA NA NA NA NA NA NA 3.4 3.5 ...
##   ..$ PF    : num [1:24] 3.5 3.2 2.9 2.7 2.9 3.2 3.6 3.2 2.9 2.9 ...
##   ..$ PTS   : num [1:24] 28.8 31.7 34.8 30.2 27 30 27.7 26.2 25.8 23.8 ...
##  $ Lebron:'data.frame':  21 obs. of  30 variables:
##   ..$ Season: chr [1:21] "2003-04" "2004-05" "2005-06" "2006-07" ...
##   ..$ Age   : int [1:21] 19 20 21 22 23 24 25 26 27 28 ...
##   ..$ Tm    : chr [1:21] "CLE" "CLE" "CLE" "CLE" ...
##   ..$ Lg    : chr [1:21] "NBA" "NBA" "NBA" "NBA" ...
##   ..$ Pos   : chr [1:21] "SG" "SF" "SF" "SF" ...
##   ..$ G     : int [1:21] 79 80 79 78 75 81 76 79 62 76 ...
##   ..$ GS    : int [1:21] 79 80 79 78 74 81 76 79 62 76 ...
##   ..$ MP    : num [1:21] 39.5 42.4 42.5 40.9 40.4 37.7 39 38.8 37.5 37.9 ...
##   ..$ FG    : num [1:21] 7.9 9.9 11.1 9.9 10.6 9.7 10.1 9.6 10 10.1 ...
##   ..$ FGA   : num [1:21] 18.9 21.1 23.1 20.8 21.9 19.9 20.1 18.8 18.9 17.8 ...
##   ..$ FG%   : num [1:21] 0.417 0.472 0.48 0.476 0.484 0.489 0.503 0.51 0.531 0.565 ...
##   ..$ 3P    : num [1:21] 0.8 1.4 1.6 1.3 1.5 1.6 1.7 1.2 0.9 1.4 ...
##   ..$ 3PA   : num [1:21] 2.7 3.9 4.8 4 4.8 4.7 5.1 3.5 2.4 3.3 ...
##   ..$ 3P%   : num [1:21] 0.29 0.351 0.335 0.319 0.315 0.344 0.333 0.33 0.362 0.406 ...
##   ..$ 2P    : num [1:21] 7.1 8.6 9.5 8.6 9.1 8.1 8.4 8.4 9.1 8.7 ...
##   ..$ 2PA   : num [1:21] 16.1 17.2 18.3 16.8 17.1 15.2 15 15.3 16.5 14.5 ...
##   ..$ 2P%   : num [1:21] 0.438 0.499 0.518 0.513 0.531 0.535 0.56 0.552 0.556 0.602 ...
##   ..$ eFG%  : num [1:21] 0.438 0.504 0.515 0.507 0.518 0.53 0.545 0.541 0.554 0.603 ...
##   ..$ FT    : num [1:21] 4.4 6 7.6 6.3 7.3 7.3 7.8 6.4 6.2 5.3 ...
##   ..$ FTA   : num [1:21] 5.8 8 10.3 9 10.3 9.4 10.2 8.4 8.1 7 ...
##   ..$ FT%   : num [1:21] 0.754 0.75 0.738 0.698 0.712 0.78 0.767 0.759 0.771 0.753 ...
##   ..$ ORB   : num [1:21] 1.3 1.4 0.9 1.1 1.8 1.3 0.9 1 1.5 1.3 ...
##   ..$ DRB   : num [1:21] 4.2 6 6.1 5.7 6.1 6.3 6.4 6.5 6.4 6.8 ...
##   ..$ TRB   : num [1:21] 5.5 7.4 7 6.7 7.9 7.6 7.3 7.5 7.9 8 ...
##   ..$ AST   : num [1:21] 5.9 7.2 6.6 6 7.2 7.2 8.6 7 6.2 7.3 ...
##   ..$ STL   : num [1:21] 1.6 2.2 1.6 1.6 1.8 1.7 1.6 1.6 1.9 1.7 ...
##   ..$ BLK   : num [1:21] 0.7 0.7 0.8 0.7 1.1 1.1 1 0.6 0.8 0.9 ...
##   ..$ TOV   : num [1:21] 3.5 3.3 3.3 3.2 3.4 3 3.4 3.6 3.4 3 ...
##   ..$ PF    : num [1:21] 1.9 1.8 2.3 2.2 2.2 1.7 1.6 2.1 1.5 1.4 ...
##   ..$ PTS   : num [1:21] 20.9 27.2 31.4 27.3 30 28.4 29.7 26.7 27.1 26.8 ...
##  $ Durant:'data.frame':  16 obs. of  30 variables:
##   ..$ Season: chr [1:16] "2007-08" "2008-09" "2009-10" "2010-11" ...
##   ..$ Age   : int [1:16] 19 20 21 22 23 24 25 26 27 28 ...
##   ..$ Tm    : chr [1:16] "SEA" "OKC" "OKC" "OKC" ...
##   ..$ Lg    : chr [1:16] "NBA" "NBA" "NBA" "NBA" ...
##   ..$ Pos   : chr [1:16] "SG" "SF" "SF" "SF" ...
##   ..$ G     : int [1:16] 80 74 82 78 66 81 81 27 72 62 ...
##   ..$ GS    : int [1:16] 80 74 82 78 66 81 81 27 72 62 ...
##   ..$ MP    : num [1:16] 34.6 39 39.5 38.9 38.6 38.5 38.5 33.8 35.8 33.4 ...
##   ..$ FG    : num [1:16] 7.3 8.9 9.7 9.1 9.7 9 10.5 8.8 9.7 8.9 ...
##   ..$ FGA   : num [1:16] 17.1 18.8 20.3 19.7 19.7 17.7 20.8 17.3 19.2 16.5 ...
##   ..$ FG%   : num [1:16] 0.43 0.476 0.476 0.462 0.496 0.51 0.503 0.51 0.505 0.537 ...
##   ..$ 3P    : num [1:16] 0.7 1.3 1.6 1.9 2 1.7 2.4 2.4 2.6 1.9 ...
##   ..$ 3PA   : num [1:16] 2.6 3.1 4.3 5.3 5.2 4.1 6.1 5.9 6.7 5 ...
##   ..$ 3P%   : num [1:16] 0.288 0.422 0.365 0.35 0.387 0.416 0.391 0.403 0.387 0.375 ...
##   ..$ 2P    : num [1:16] 6.6 7.6 8.1 7.3 7.7 7.3 8.1 6.4 7.1 7 ...
##   ..$ 2PA   : num [1:16] 14.5 15.7 16.1 14.4 14.4 13.6 14.8 11.4 12.5 11.5 ...
##   ..$ 2P%   : num [1:16] 0.455 0.486 0.506 0.504 0.535 0.539 0.549 0.565 0.569 0.608 ...
##   ..$ eFG%  : num [1:16] 0.451 0.51 0.514 0.509 0.547 0.559 0.56 0.578 0.573 0.594 ...
##   ..$ FT    : num [1:16] 4.9 6.1 9.2 7.6 6.5 8.4 8.7 5.4 6.2 5.4 ...
##   ..$ FTA   : num [1:16] 5.6 7.1 10.2 8.7 7.6 9.3 9.9 6.3 6.9 6.2 ...
##   ..$ FT%   : num [1:16] 0.873 0.863 0.9 0.88 0.86 0.905 0.873 0.854 0.898 0.875 ...
##   ..$ ORB   : num [1:16] 0.9 1 1.3 0.7 0.6 0.6 0.7 0.6 0.6 0.6 ...
##   ..$ DRB   : num [1:16] 3.5 5.5 6.3 6.1 7.4 7.3 6.7 6 7.6 7.6 ...
##   ..$ TRB   : num [1:16] 4.4 6.5 7.6 6.8 8 7.9 7.4 6.6 8.2 8.3 ...
##   ..$ AST   : num [1:16] 2.4 2.8 2.8 2.7 3.5 4.6 5.5 4.1 5 4.8 ...
##   ..$ STL   : num [1:16] 1 1.3 1.4 1.1 1.3 1.4 1.3 0.9 1 1.1 ...
##   ..$ BLK   : num [1:16] 0.9 0.7 1 1 1.2 1.3 0.7 0.9 1.2 1.6 ...
##   ..$ TOV   : num [1:16] 2.9 3 3.3 2.8 3.8 3.5 3.5 2.7 3.5 2.2 ...
##   ..$ PF    : num [1:16] 1.5 1.8 2.1 2 2 1.8 2.1 1.5 1.9 1.9 ...
##   ..$ PTS   : num [1:16] 20.3 25.3 30.1 27.7 28 28.1 32 25.4 28.2 25.1 ...

Cleaning Data

Next we’ll want to tidy this list into a dataframe, adding an identifier column for player name.

tidy_stats <- function(lst) {
  out <- NULL
    
  for (i in seq_along(lst)) {
    this <- lst[[i]] %>% 
      mutate(player = names(lst[i]))
    
    out <- out %>% bind_rows(this)
  }
  out <- out %>% as_tibble()
  return(out)
}
(tidied <- tidy_stats(all_stats))
## # A tibble: 61 x 31
##    Season   Age Tm    Lg    Pos       G    GS    MP    FG   FGA `FG%`  `3P`
##    <chr>  <int> <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 1969-…    22 MIL   NBA   C        82    NA  43.1  11.4  22.1 0.518    NA
##  2 1970-…    23 MIL   NBA   C        82    NA  40.1  13    22.5 0.577    NA
##  3 1971-…    24 MIL   NBA   C        81    NA  44.2  14.3  24.9 0.574    NA
##  4 1972-…    25 MIL   NBA   C        76    NA  42.8  12.9  23.3 0.554    NA
##  5 1973-…    26 MIL   NBA   C        81    NA  43.8  11.7  21.7 0.539    NA
##  6 1974-…    27 MIL   NBA   C        65    NA  42.3  12.5  24.4 0.513    NA
##  7 1975-…    28 LAL   NBA   C        82    NA  41.2  11.1  21.1 0.529    NA
##  8 1976-…    29 LAL   NBA   C        82    NA  36.8  10.8  18.7 0.579    NA
##  9 1977-…    30 LAL   NBA   C        62    NA  36.5  10.7  19.4 0.55     NA
## 10 1978-…    31 LAL   NBA   C        80    NA  39.5   9.7  16.8 0.577    NA
## # … with 51 more rows, and 19 more variables: `3PA` <dbl>, `3P%` <dbl>,
## #   `2P` <dbl>, `2PA` <dbl>, `2P%` <dbl>, `eFG%` <dbl>, FT <dbl>,
## #   FTA <dbl>, `FT%` <dbl>, ORB <dbl>, DRB <dbl>, TRB <dbl>, AST <dbl>,
## #   STL <dbl>, BLK <dbl>, TOV <dbl>, PF <dbl>, PTS <dbl>, player <chr>

Now we have a dataframe, but it’s still a bit messy. We scraped the entire table which contains rows that aren’t actually seasons. Instead they’re summaries of multiple seasons, indicated with things like “14 Seasons”. Since these rows contain letters in the season column, they’re easy to identify and drop.

We also want to add a column indicating season number. This allows us to equate Kareem’s third season with Lebron’s third season with Durant’s third season. And of course, we need to get a running tally of total average points per game for each season.

tidied_clean <- 
  tidied %>% 
  filter(!(str_detect(Season, "[A-Za-z]") |
           Season == "")) %>% 
  group_by(player) %>% 
  rename(
    season_points = PTS
  ) %>% 
  mutate(
    season_num = row_number(),
    cumulative_points = cumsum(season_points)
  ) %>% 
  select(Season, player, season_num, season_points, cumulative_points) 

Now we’ve got our clean data!

tidied_clean %>% 
  knitr::kable()
Season player season_num season_points cumulative_points
1969-70 Kareem 1 28.8 28.8
1970-71 Kareem 2 31.7 60.5
1971-72 Kareem 3 34.8 95.3
1972-73 Kareem 4 30.2 125.5
1973-74 Kareem 5 27.0 152.5
1974-75 Kareem 6 30.0 182.5
1975-76 Kareem 7 27.7 210.2
1976-77 Kareem 8 26.2 236.4
1977-78 Kareem 9 25.8 262.2
1978-79 Kareem 10 23.8 286.0
1979-80 Kareem 11 24.8 310.8
1980-81 Kareem 12 26.2 337.0
1981-82 Kareem 13 23.9 360.9
1982-83 Kareem 14 21.8 382.7
1983-84 Kareem 15 21.5 404.2
1984-85 Kareem 16 22.0 426.2
1985-86 Kareem 17 23.4 449.6
1986-87 Kareem 18 17.5 467.1
1987-88 Kareem 19 14.6 481.7
1988-89 Kareem 20 10.1 491.8
2003-04 Lebron 1 20.9 20.9
2004-05 Lebron 2 27.2 48.1
2005-06 Lebron 3 31.4 79.5
2006-07 Lebron 4 27.3 106.8
2007-08 Lebron 5 30.0 136.8
2008-09 Lebron 6 28.4 165.2
2009-10 Lebron 7 29.7 194.9
2010-11 Lebron 8 26.7 221.6
2011-12 Lebron 9 27.1 248.7
2012-13 Lebron 10 26.8 275.5
2013-14 Lebron 11 27.1 302.6
2014-15 Lebron 12 25.3 327.9
2015-16 Lebron 13 25.3 353.2
2016-17 Lebron 14 26.4 379.6
2017-18 Lebron 15 27.5 407.1
2018-19 Lebron 16 27.4 434.5
2007-08 Durant 1 20.3 20.3
2008-09 Durant 2 25.3 45.6
2009-10 Durant 3 30.1 75.7
2010-11 Durant 4 27.7 103.4
2011-12 Durant 5 28.0 131.4
2012-13 Durant 6 28.1 159.5
2013-14 Durant 7 32.0 191.5
2014-15 Durant 8 25.4 216.9
2015-16 Durant 9 28.2 245.1
2016-17 Durant 10 25.1 270.2
2017-18 Durant 11 26.4 296.6
2018-19 Durant 12 26.8 323.4

Moment of Truth

Next it seems like we’d want to plot this to get a sense of who’s on track to unseat Kareem. I’ll use ggplot2 along with Bob Rudis’s hrbrthemes::theme_ipsum and Karthik Ram’s wesanderson::wes_palette.

pal <- wesanderson::wes_palette("Darjeeling1")

ggplot(tidied_clean) +
  geom_point(aes(x = season_num, y = cumulative_points, colour = player),
             stat = "identity") +
  labs(x = "Season Number", y = "Cumulative Points", colour = "Player") +
  ggtitle("Catching Kareem", subtitle = "Average points per game") +
  hrbrthemes::theme_ipsum() +
  scale_colour_manual(values = pal) 

In Lebron’s most recent season, it looks like he finally caught Kareem. Did he?

tidied_clean %>% 
  filter(season_num == 15) %>% 
  knitr::kable()
Season player season_num season_points cumulative_points
1983-84 Kareem 15 21.5 404.2
2017-18 Lebron 15 27.5 407.1

Going by average points scored per game, he’s already overtaken Kareem. Swish.