Creating the Tao of Tim

Outsourcing exposure to new ideas through webscraping and automation

Overview

The Tim Ferriss Show is a podcast produced by Tim Ferriss, which aims to deconstruct world-class performers and share their tools and tactics along the way.

The goal of this project is to programmatically extract the show links from each episode and create a job that runs each morning to serve up a random link. By doing this, I hope to automate the exposure to different ideas from top-performers.

Finding The Data

library(dplyr)
library(httr2)
library(here)
library(stringr)
library(tidyr)
library(xml2)
library(here)
library(telegram.bot)

Show links are available for each podcast guest here. Links to the individual podcast episodes are located within the tim-ferriss-podcast-list div tag.

The only problem is - a handful of episodes are visible at one time. To expose more episodes, I need to click on the load-more-podcasts button which runs some javascript to access a Wordpress API to load additional podcasts from the tim_podcasts.endpoint.

To work around this, I figured I would use some Python and the selenium library to programatically click the load-more-podcasts button until it was disabled.

The selenium approach worked - to a degree - but there is javascript within Tim’s website that renders multiple call-to-action pages. I could spend more time finding a way around the javascript, but as I was thinking about how to do that I stumbled across the sitemap.

Now this is what I’m looking for!

No need to create a headless browser and deal with all this javascript, I can instead get right to the links I’m after.

Gathering The URLs

Since the site map is in xml format, I first need to use the xml2 package to parse the URL for each web page. Thankfully, there is a consistent naming convention so I can use the tidyr::separate_wider_regex() function to identify the upload date and title for each page.

Code

# Scrape Site Map and Clean

raw_xml <- xml2::read_xml("https://tim.blog/post-sitemap2.xml")

site_df <- raw_xml |> 
  xml2::xml_ns_strip() |> 
  xml2::xml_find_all(".//url") |> 
  xml2::xml_find_all(".//loc") |> 
  xml2::xml_text() |> 
  tibble::as_tibble_col(column_name = "urls") |> 
  tidyr::separate_wider_regex(
    urls,
    patterns = c(
      "https://tim.blog/",
      year = "[:digit:]{4}",
      "/",
      month = "[:digit:]{2}",
      "/",
      day = "[:digit:]{2}",
      "/",
      article = ".*",
      "/"
    ),
    cols_remove = FALSE
  ) |> 
  dplyr::mutate(
    upload_date = lubridate::ymd(paste0(year, month, day)),
    .keep = "unused"
  )

After a quick review of the URLs, several patterns start to stand out. First, Tim posts transcripts of each podcast episode on his site. He also posts several different flavors of recap episodes, along with content from other projects he has created, such as Tools of Titans.

If I make a list of keywords from the patterns identified above, combined with a filter on the upload date to strip out any URL that occurred before the first podcast episode, I should be able to pare down my dataframe to just the podcast episode web pages.

Being the Tim Ferriss Show connoisseur that I am, I also know that he took a sabbatical in the middle of 2024. To fill the content gap, he published “new” episodes that combined two past podcast episodes. Since I only want the show links for each original podcast, I will need to filter out this chunk of time as well.

Code

# disregard non-pertinent urls after manual review of site_df
black_list <- c("transcript", "transcipt", "in-case-you-missed",
                "recap", "tools-of-titans", "cockpunch", "top-",
                "insights-from-")

podcast_df <- site_df |> 
  # filtering to on or after the first podcast episode
  dplyr::filter(upload_date >= as.Date("2014-04-22")) |>
  # removing a stretch of time where old podcasts were combined to make a new podcast
  dplyr::filter(upload_date > as.Date("2024-08-29") |
                  upload_date < as.Date("2024-05-16")) |>
  dplyr::filter(stringr::str_detect(article, paste(black_list, collapse = "|")) == FALSE) |> 
  # removing one-off recap that would cause duplicate show links
  dplyr::filter(article != "the-30-most-popular-episodes-of-the-tim-ferriss-show-from-2022")

And with that, I have a dataframe of each Tim Ferriss Show podcast episode and its upload date! Now, it’s time to get to scraping.

Scraping The Episodes

Since I am a fan of Tim’s, and certainly not trying to get in trouble with him (if you’re reading this Tim, hello!), I want to be respectful while I’m scraping. Enter, the polite package. By using polite::bow(), I can engage with the host once, gain an understanding for the robots.txt file that is in place, and obey the scraping limitations while gathering the data I’m looking for.

By setting up a little function, I can polite::nod() to each podcast URL to continue my single point of contact with the host while scraping under the prescribed parameters. Using the rvest package, I can gather both the text and the href attribute for each show note link. Bundling this function with the purrr::map() function, I can iterate over each URL and build how the final show links dataframe.

Code

session <- polite::bow("https://tim.blog/")

get_show_links <- function(url) {
  tryCatch(
    {
      # create throwaway list for each list item on a podcast web page
      foo <- session |> 
        polite::nod(path = url) |>
        polite::scrape() |> 
        rvest::html_elements(".wp-block-list li a")
      
      # build dataframe from throwaway list to capture link title and link URL
      bar <- data.frame(
        link_title = foo |> rvest::html_text(),
        link_url = foo |> rvest::html_attr("href")
      )
      
      return(bar)
    }, 
    
    error = function(msg) {
      message(paste("The article", url, "encountered an issue when scraping show links."))
      return(NA)
    }
  )
}

# need to unnest the show_link column which returns a dataframe for each podcast URL to tidy the data
show_links_df <- podcast_df |> 
  dplyr::mutate(show_links = purrr::map(urls, get_show_links)) |> 
  tidyr::unnest_longer(show_links) |> 
  tidyr::unnest_wider(show_links)

Sending The Show Links

With my dataframe of show note links in-hand, I just need to find a way to send myself a random link each morning. The random link piece is straight forward, I can use the dplyr::slice_sample() function to pull out a new link each day. The part that is more involved is setting up a way to get the link to me.

Enter, GitHub Actions and Telegram. By using GitHub Actions, I can automatically run a script to randomly select a show link. Using Telegram, and the telegram.bot R package, I can send the show link as a Telegram message to either myself or a Telegram channel.

Quite a bit has been written about setting up a Telegram bot. Instead of adding more, I will instead point you towards a helpful blog post written by my friend Brad Lindblad. In the post, he automates the scraping of weekly meat specials using Python and Telegram.

After following along with Brad’s blog post, I now have a Telegram bot. I’ve tested the telegram.bot::bot() setup and have sent a few test messages to myself using bot$sendMessage(). I think I’m ready to combine the link selection and message sending into a script.

Looking at the bot$sendMessage() documentation, I see there is a parse_mode argument. Since I have show links, and I want to provide the URL to the link that was selected, I need to use the markdown option to properly format my message. With a little magic from paste0(), I create the variable telegram_message which is assigned a string that matches the markdown syntax I need to send out the message.

Where this work differs from Brad’s post lies with the GitHub Action used to run the random link script. Since I am working in R and I am using the renv package to manage my project environment, I need to include a couple different jobs to get things working. The R community has a convenient repository of GitHub Actions available so I can just plug in both the setup-r@v2 and setup-renv@v2 GitHub Actions and I am off and running. You can see the full workflow below.

name: Tao te Tim Telegram

on: 
  workflow_dispatch:
  schedule:
    -  cron: '30 6 * * *'

jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      -  name: Checkout repository
         uses: actions/checkout@v4

      -  name: Install R
         uses: r-lib/actions/setup-r@v2
         with:
           r-version: '4.4.3'

      -  name: Install libcurl
         if: runner.os == 'Linux'
         run: sudo apt-get update -y && sudo apt-get install -y libcurl4-openssl-dev

      -  name: Set up renv
         uses: r-lib/actions/setup-renv@v2

      -  name: Run send_telegram
         run: Rscript -e 'source("R/send_telegram.R")'
         env:
           R_TELEGRAM_BOT_TIM_TAO_BOT: ${{ secrets.R_TELEGRAM_BOT_TIM_TAO_BOT }}
           TELEGRAM_CHANNEL_ID:      ${{ secrets.TELEGRAM_CHANNEL_ID }}

One word of caution as it relates to the secrets. I assumed a secret that was a string value needed to be entered into GitHub in the same fashion, using quotes. This is not the case. If you do, you will receive an error when running the action. Hopefully this will save you some time if you find yourself in a similar situation.

Wrap Up

With a little help from Tim’s site map, I was able to locate and clean show notes from over 700 podcast episodes. Combined with just a bit of set up to create a Telegram bot, along with a short GitHub Actions script, I now get a new show note link each morning. And if you’re feeling left out, don’t worry. You can join the Tao of Tim Telegram channel too!