library(dplyr)
library(httr2)
library(here)
library(stringr)
library(tidyr)
library(xml2)
library(here)
library(telegram.bot)
Creating the Tao of Tim
Outsourcing exposure to new ideas through webscraping and automation
Overview
The Tim Ferriss Show is a podcast produced by Tim Ferriss, which aims to deconstruct world-class performers and share their tools and tactics along the way.
The goal of this project is to programmatically extract the show links from each episode and create a job that runs each morning to serve up a random link. By doing this, I hope to automate the exposure to different ideas from top-performers.
Finding The Data
Show links are available for each podcast guest here. Links to the individual podcast episodes are located within the tim-ferriss-podcast-list
div tag.
The only problem is - a handful of episodes are visible at one time. To expose more episodes, I need to click on the load-more-podcasts
button which runs some javascript to access a Wordpress API to load additional podcasts from the tim_podcasts.endpoint
.
To work around this, I figured I would use some Python
and the selenium
library to programatically click the load-more-podcasts
button until it was disabled.
The selenium
approach worked - to a degree - but there is javascript within Tim’s website that renders multiple call-to-action pages. I could spend more time finding a way around the javascript, but as I was thinking about how to do that I stumbled across the sitemap.
Now this is what I’m looking for!
No need to create a headless browser and deal with all this javascript, I can instead get right to the links I’m after.
Gathering The URLs
Since the site map is in xml
format, I first need to use the xml2
package to parse the URL for each web page. Thankfully, there is a consistent naming convention so I can use the tidyr::separate_wider_regex()
function to identify the upload date and title for each page.
Code
# Scrape Site Map and Clean
<- xml2::read_xml("https://tim.blog/post-sitemap2.xml")
raw_xml
<- raw_xml |>
site_df ::xml_ns_strip() |>
xml2::xml_find_all(".//url") |>
xml2::xml_find_all(".//loc") |>
xml2::xml_text() |>
xml2::as_tibble_col(column_name = "urls") |>
tibble::separate_wider_regex(
tidyr
urls,patterns = c(
"https://tim.blog/",
year = "[:digit:]{4}",
"/",
month = "[:digit:]{2}",
"/",
day = "[:digit:]{2}",
"/",
article = ".*",
"/"
),cols_remove = FALSE
|>
) ::mutate(
dplyrupload_date = lubridate::ymd(paste0(year, month, day)),
.keep = "unused"
)
After a quick review of the URLs, several patterns start to stand out. First, Tim posts transcripts of each podcast episode on his site. He also posts several different flavors of recap episodes, along with content from other projects he has created, such as Tools of Titans.
If I make a list of keywords from the patterns identified above, combined with a filter on the upload date to strip out any URL that occurred before the first podcast episode, I should be able to pare down my dataframe to just the podcast episode web pages.
Being the Tim Ferriss Show connoisseur that I am, I also know that he took a sabbatical in the middle of 2024. To fill the content gap, he published “new” episodes that combined two past podcast episodes. Since I only want the show links for each original podcast, I will need to filter out this chunk of time as well.
Code
# disregard non-pertinent urls after manual review of site_df
<- c("transcript", "transcipt", "in-case-you-missed",
black_list "recap", "tools-of-titans", "cockpunch", "top-",
"insights-from-")
<- site_df |>
podcast_df # filtering to on or after the first podcast episode
::filter(upload_date >= as.Date("2014-04-22")) |>
dplyr# removing a stretch of time where old podcasts were combined to make a new podcast
::filter(upload_date > as.Date("2024-08-29") |
dplyr< as.Date("2024-05-16")) |>
upload_date ::filter(stringr::str_detect(article, paste(black_list, collapse = "|")) == FALSE) |>
dplyr# removing one-off recap that would cause duplicate show links
::filter(article != "the-30-most-popular-episodes-of-the-tim-ferriss-show-from-2022") dplyr
And with that, I have a dataframe of each Tim Ferriss Show podcast episode and its upload date! Now, it’s time to get to scraping.
Scraping The Episodes
Since I am a fan of Tim’s, and certainly not trying to get in trouble with him (if you’re reading this Tim, hello!), I want to be respectful while I’m scraping. Enter, the polite
package. By using polite::bow()
, I can engage with the host once, gain an understanding for the robots.txt
file that is in place, and obey the scraping limitations while gathering the data I’m looking for.
By setting up a little function, I can polite::nod()
to each podcast URL to continue my single point of contact with the host while scraping under the prescribed parameters. Using the rvest
package, I can gather both the text and the href
attribute for each show note link. Bundling this function with the purrr::map()
function, I can iterate over each URL and build how the final show links dataframe.
Code
<- polite::bow("https://tim.blog/")
session
<- function(url) {
get_show_links tryCatch(
{# create throwaway list for each list item on a podcast web page
<- session |>
foo ::nod(path = url) |>
polite::scrape() |>
polite::html_elements(".wp-block-list li a")
rvest
# build dataframe from throwaway list to capture link title and link URL
<- data.frame(
bar link_title = foo |> rvest::html_text(),
link_url = foo |> rvest::html_attr("href")
)
return(bar)
},
error = function(msg) {
message(paste("The article", url, "encountered an issue when scraping show links."))
return(NA)
}
)
}
# need to unnest the show_link column which returns a dataframe for each podcast URL to tidy the data
<- podcast_df |>
show_links_df ::mutate(show_links = purrr::map(urls, get_show_links)) |>
dplyr::unnest_longer(show_links) |>
tidyr::unnest_wider(show_links) tidyr
Sending The Show Links
With my dataframe of show note links in-hand, I just need to find a way to send myself a random link each morning. The random link piece is straight forward, I can use the dplyr::slice_sample()
function to pull out a new link each day. The part that is more involved is setting up a way to get the link to me.
Enter, GitHub Actions and Telegram. By using GitHub Actions, I can automatically run a script to randomly select a show link. Using Telegram, and the telegram.bot
R package, I can send the show link as a Telegram message to either myself or a Telegram channel.
Quite a bit has been written about setting up a Telegram bot. Instead of adding more, I will instead point you towards a helpful blog post written by my friend Brad Lindblad. In the post, he automates the scraping of weekly meat specials using Python and Telegram.
After following along with Brad’s blog post, I now have a Telegram bot. I’ve tested the telegram.bot::bot()
setup and have sent a few test messages to myself using bot$sendMessage()
. I think I’m ready to combine the link selection and message sending into a script.
Looking at the bot$sendMessage()
documentation, I see there is a parse_mode
argument. Since I have show links, and I want to provide the URL to the link that was selected, I need to use the markdown
option to properly format my message. With a little magic from paste0()
, I create the variable telegram_message
which is assigned a string that matches the markdown syntax I need to send out the message.
Where this work differs from Brad’s post lies with the GitHub Action used to run the random link script. Since I am working in R and I am using the renv
package to manage my project environment, I need to include a couple different jobs to get things working. The R community has a convenient repository of GitHub Actions available so I can just plug in both the setup-r@v2
and setup-renv@v2
GitHub Actions and I am off and running. You can see the full workflow below.
: Tao te Tim Telegram
name
:
on:
workflow_dispatch:
schedule- cron: '30 6 * * *'
:
jobs:
build-on: ubuntu-latest
runs:
steps- name: Checkout repository
: actions/checkout@v4
uses
- name: Install R
: r-lib/actions/setup-r@v2
uses:
with-version: '4.4.3'
r
- name: Install libcurl
if: runner.os == 'Linux'
: sudo apt-get update -y && sudo apt-get install -y libcurl4-openssl-dev
run
- name: Set up renv
: r-lib/actions/setup-renv@v2
uses
- name: Run send_telegram
: Rscript -e 'source("R/send_telegram.R")'
run:
env: ${{ secrets.R_TELEGRAM_BOT_TIM_TAO_BOT }}
R_TELEGRAM_BOT_TIM_TAO_BOT: ${{ secrets.TELEGRAM_CHANNEL_ID }} TELEGRAM_CHANNEL_ID
One word of caution as it relates to the secrets. I assumed a secret that was a string value needed to be entered into GitHub in the same fashion, using quotes. This is not the case. If you do, you will receive an error when running the action. Hopefully this will save you some time if you find yourself in a similar situation.
Wrap Up
With a little help from Tim’s site map, I was able to locate and clean show notes from over 700 podcast episodes. Combined with just a bit of set up to create a Telegram bot, along with a short GitHub Actions script, I now get a new show note link each morning. And if you’re feeling left out, don’t worry. You can join the Tao of Tim Telegram channel too!