Creating the Tao of Tim

Outsourcing exposure to new ideas through webscraping and automation

Overview

The Tim Ferriss Show is a podcast produced by Tim Ferriss, which aims to deconstruct world-class performers and share their tools and tactics along the way.

The goal of this project is to programmatically extract the show links from each episode and create a job that runs each morning to serve up a random link. By doing this, I hope to automate the exposure to different ideas from top-performers.

Finding The Data

library(dplyr)
library(httr2)
library(here)
library(stringr)
library(tidyr)
library(xml2)
library(here)
library(telegram.bot)

Show links are available for each podcast guest here. Links to the individual podcast episodes are located within the tim-ferriss-podcast-list div tag.

The only problem is - a handful of episodes are visible at one time. To expose more episodes, I need to click on the load-more-podcasts button which runs some javascript to access a Wordpress API to load additional podcasts from the tim_podcasts.endpoint.

To work around this, I figured I would use some Python and the selenium library to programatically click the load-more-podcasts button until it was disabled.

The selenium approach worked - to a degree - but there is javascript within Tim’s website that renders multiple call-to-action pages. I could spend more time finding a way around the javascript, but as I was thinking about how to do that I stumbled across the sitemap.

Now this is what I’m looking for!

No need to create a headless browser and deal with all this javascript, I can instead get right to the links I’m after.

Gathering The URLs

Since the site map is in xml format, I first need to use the xml2 package to parse the URL for each web page. Thankfully, there is a consistent naming convention so I can use the tidyr::separate_wider_regex() function to identify the upload date and title for each page.

Code
# Scrape Site Map and Clean

raw_xml <- xml2::read_xml("https://tim.blog/post-sitemap2.xml")

site_df <- raw_xml |> 
  xml2::xml_ns_strip() |> 
  xml2::xml_find_all(".//url") |> 
  xml2::xml_find_all(".//loc") |> 
  xml2::xml_text() |> 
  tibble::as_tibble_col(column_name = "urls") |> 
  tidyr::separate_wider_regex(
    urls,
    patterns = c(
      "https://tim.blog/",
      year = "[:digit:]{4}",
      "/",
      month = "[:digit:]{2}",
      "/",
      day = "[:digit:]{2}",
      "/",
      article = ".*",
      "/"
    ),
    cols_remove = FALSE
  ) |> 
  dplyr::mutate(
    upload_date = lubridate::ymd(paste0(year, month, day)),
    .keep = "unused"
  )

After a quick review of the URLs, several patterns start to stand out. First, Tim posts transcripts of each podcast episode on his site. He also posts several different flavors of recap episodes, along with content from other projects he has created, such as Tools of Titans.

If I make a list of keywords from the patterns identified above, combined with a filter on the upload date to strip out any URL that occurred before the first podcast episode, I should be able to pare down my dataframe to just the podcast episode web pages.

Being the Tim Ferriss Show connoisseur that I am, I also know that he took a sabbatical in the middle of 2024. To fill the content gap, he published “new” episodes that combined two past podcast episodes. Since I only want the show links for each original podcast, I will need to filter out this chunk of time as well.

Code
# disregard non-pertinent urls after manual review of site_df
black_list <- c("transcript", "transcipt", "in-case-you-missed",
                "recap", "tools-of-titans", "cockpunch", "top-",
                "insights-from-")

podcast_df <- site_df |> 
  # filtering to on or after the first podcast episode
  dplyr::filter(upload_date >= as.Date("2014-04-22")) |>
  # removing a stretch of time where old podcasts were combined to make a new podcast
  dplyr::filter(upload_date > as.Date("2024-08-29") |
                  upload_date < as.Date("2024-05-16")) |>
  dplyr::filter(stringr::str_detect(article, paste(black_list, collapse = "|")) == FALSE) |> 
  # removing one-off recap that would cause duplicate show links
  dplyr::filter(article != "the-30-most-popular-episodes-of-the-tim-ferriss-show-from-2022")

And with that, I have a dataframe of each Tim Ferriss Show podcast episode and its upload date! Now, it’s time to get to scraping.

Scraping The Episodes

Since I am a fan of Tim’s, and certainly not trying to get in trouble with him (if you’re reading this Tim, hello!), I want to be respectful while I’m scraping. Enter, the polite package. By using polite::bow(), I can engage with the host once, gain an understanding for the robots.txt file that is in place, and obey the scraping limitations while gathering the data I’m looking for.

By setting up a little function, I can polite::nod() to each podcast URL to continue my single point of contact with the host while scraping under the prescribed parameters. Using the rvest package, I can gather both the text and the href attribute for each show note link. Bundling this function with the purrr::map() function, I can iterate over each URL and build how the final show links dataframe.

Code
session <- polite::bow("https://tim.blog/")

get_show_links <- function(url) {
  tryCatch(
    {
      # create throwaway list for each list item on a podcast web page
      foo <- session |> 
        polite::nod(path = url) |>
        polite::scrape() |> 
        rvest::html_elements(".wp-block-list li a")
      
      # build dataframe from throwaway list to capture link title and link URL
      bar <- data.frame(
        link_title = foo |> rvest::html_text(),
        link_url = foo |> rvest::html_attr("href")
      )
      
      return(bar)
    }, 
    
    error = function(msg) {
      message(paste("The article", url, "encountered an issue when scraping show links."))
      return(NA)
    }
  )
}

# need to unnest the show_link column which returns a dataframe for each podcast URL to tidy the data
show_links_df <- podcast_df |> 
  dplyr::mutate(show_links = purrr::map(urls, get_show_links)) |> 
  tidyr::unnest_longer(show_links) |> 
  tidyr::unnest_wider(show_links)

Wrap Up

With a little help from Tim’s site map, I was able to locate and clean show notes from over 700 podcast episodes. Combined with just a bit of set up to create a Telegram bot, along with a short GitHub Actions script, I now get a new show note link each morning. And if you’re feeling left out, don’t worry. You can join the Tao of Tim Telegram channel too!