Skip to content

yt-dlp

Purpose

CLI utility for pulling YouTube (and other platform) metadata and auto-generated captions without downloading video files. Primary use: extracting transcripts from YouTube videos and podcast episodes for the content-fuel library at /Claude/knowledge/intel/content-fuel/.

How we use it

  • Transcript extraction for content fuel. When Rick or anyone on the team sends a YouTube link as a "good piece, worth reading," the CMO pulls the auto-captions, cleans them, and files the piece in /Claude/knowledge/intel/content-fuel/ with a frontmatter block and a "Takeaways for eco|monetize" summary at the top.
  • Podcast episode transcription when the episode is hosted on YouTube (many GTM/VC podcasts dual-publish).
  • Metadata capture — title, duration, upload date, description, view count — via --write-info-json.

Install

brew install yt-dlp

Installed on Rick's Mac 2026-04-17. No auth or credentials required for public videos.

Canonical commands

Pull auto-captions + metadata (no video download):

yt-dlp --write-auto-sub --write-sub --sub-lang en --skip-download \
  --sub-format vtt --write-info-json \
  -o "youtube-%(id)s-%(title)s.%(ext)s" \
  "https://youtu.be/<VIDEO_ID>"

Produces three files: - *.en.vtt — auto-generated captions in VTT format - *.info.json — metadata (title, duration, description, channel, etc.) - *.en.srt — if manual subtitles exist (often absent)

Check available subtitle tracks before downloading:

yt-dlp --list-subs "https://youtu.be/<VIDEO_ID>"

Cleaning VTT → markdown

Auto-caption VTT files are noisy — rolling captions duplicate text, timestamps embed inline, positional tags litter the content. Python one-liner pattern used in the Auren Hoffman transcript capture (2026-04-17):

import re
with open(vtt_path) as f:
    lines = f.read().splitlines()

seen = set()
out = []
for ln in lines:
    if not ln.strip(): continue
    if ln.startswith(("WEBVTT", "Kind:", "Language:")): continue
    if "-->" in ln: continue
    ln = re.sub(r"<[^>]+>", "", ln).strip()
    if not ln or ln in seen: continue
    seen.add(ln)
    out.append(ln)

text = re.sub(r"\s+", " ", " ".join(out)).strip()
# Then paragraph-break every ~800 chars at sentence boundaries

Workflow for content-fuel captures

  1. Pull VTT + info.json via yt-dlp (commands above)
  2. Clean VTT to plain text, paragraph-break for readability
  3. Load info.json for title, channel, duration, description
  4. Create /Claude/knowledge/intel/content-fuel/YYYY-MM-DD-{source}-{slug}.md with:
  5. Frontmatter per /Claude/knowledge/intel/content-fuel/README.md
  6. "Takeaways for eco|monetize" section mined by CMO
  7. "Suggested content pipeline" table routing to Substack/LinkedIn/blog
  8. Raw transcript appended at the bottom
  9. Keep .raw.vtt and .info.json as sibling artifacts for auditability

Known limitations

  • Auto-captions only unless the uploader has published manual subtitles. Auto-captions miss speaker attribution and garble technical terms and names.
  • ffmpeg missing warning — yt-dlp will warn but still succeed for captions-only extraction. Install ffmpeg (brew install ffmpeg) only if we move into audio/video download use cases.
  • No speaker diarization. For podcast transcripts where speaker attribution matters, cross-reference the cleaned transcript against the original video or use a diarization service downstream.
  • Impersonation warning — benign for public-video caption pulls, but future platform changes may require installing impersonation targets. Monitor if caption pulls start failing.

When NOT to use yt-dlp

  • Private meeting recordings — use the native meeting platform transcription (Sembly, Granola, Google Meet) instead. yt-dlp is for public, published media only.
  • Audio-only files (.m4a, .wav, .mp3) stored locally — route to mining.mind for local audio transcription via Whisper or equivalent.
  • Platforms outside YouTube — yt-dlp supports many platforms (Vimeo, SoundCloud, etc.) but each has quirks. Check yt-dlp --list-extractors and test before promising support.
  • Content-fuel library: /Claude/knowledge/intel/content-fuel/
  • Example capture: /Claude/knowledge/intel/content-fuel/2026-04-17-gtmnow-auren-hoffman-software-spend.md

Owner: cmo · Seeded: 2026-04-17