yt-dlp¶

Purpose¶

CLI utility for pulling YouTube (and other platform) metadata and auto-generated captions without downloading video files. Primary use: extracting transcripts from YouTube videos and podcast episodes for the content-fuel library at /Claude/knowledge/intel/content-fuel/.

How we use it¶

Transcript extraction for content fuel. When Rick or anyone on the team sends a YouTube link as a "good piece, worth reading," the CMO pulls the auto-captions, cleans them, and files the piece in /Claude/knowledge/intel/content-fuel/ with a frontmatter block and a "Takeaways for eco|monetize" summary at the top.
Podcast episode transcription when the episode is hosted on YouTube (many GTM/VC podcasts dual-publish).
Metadata capture — title, duration, upload date, description, view count — via --write-info-json.

Install¶

brew install yt-dlp

Installed on Rick's Mac 2026-04-17. No auth or credentials required for public videos.

Canonical commands¶

Pull auto-captions + metadata (no video download):

yt-dlp --write-auto-sub --write-sub --sub-lang en --skip-download \
  --sub-format vtt --write-info-json \
  -o "youtube-%(id)s-%(title)s.%(ext)s" \
  "https://youtu.be/<VIDEO_ID>"

Produces three files: - *.en.vtt — auto-generated captions in VTT format - *.info.json — metadata (title, duration, description, channel, etc.) - *.en.srt — if manual subtitles exist (often absent)

Check available subtitle tracks before downloading:

yt-dlp --list-subs "https://youtu.be/<VIDEO_ID>"

Cleaning VTT → markdown¶

Auto-caption VTT files are noisy — rolling captions duplicate text, timestamps embed inline, positional tags litter the content. Python one-liner pattern used in the Auren Hoffman transcript capture (2026-04-17):

import re
with open(vtt_path) as f:
    lines = f.read().splitlines()

seen = set()
out = []
for ln in lines:
    if not ln.strip(): continue
    if ln.startswith(("WEBVTT", "Kind:", "Language:")): continue
    if "-->" in ln: continue
    ln = re.sub(r"<[^>]+>", "", ln).strip()
    if not ln or ln in seen: continue
    seen.add(ln)
    out.append(ln)

text = re.sub(r"\s+", " ", " ".join(out)).strip()
# Then paragraph-break every ~800 chars at sentence boundaries

Workflow for content-fuel captures¶

Pull VTT + info.json via yt-dlp (commands above)
Clean VTT to plain text, paragraph-break for readability
Load info.json for title, channel, duration, description
Create /Claude/knowledge/intel/content-fuel/YYYY-MM-DD-{source}-{slug}.md with:
Frontmatter per /Claude/knowledge/intel/content-fuel/README.md
"Takeaways for eco|monetize" section mined by CMO
"Suggested content pipeline" table routing to Substack/LinkedIn/blog
Raw transcript appended at the bottom
Keep .raw.vtt and .info.json as sibling artifacts for auditability

Known limitations¶

Auto-captions only unless the uploader has published manual subtitles. Auto-captions miss speaker attribution and garble technical terms and names.
ffmpeg missing warning — yt-dlp will warn but still succeed for captions-only extraction. Install ffmpeg (brew install ffmpeg) only if we move into audio/video download use cases.
No speaker diarization. For podcast transcripts where speaker attribution matters, cross-reference the cleaned transcript against the original video or use a diarization service downstream.
Impersonation warning — benign for public-video caption pulls, but future platform changes may require installing impersonation targets. Monitor if caption pulls start failing.

When NOT to use yt-dlp¶

Private meeting recordings — use the native meeting platform transcription (Sembly, Granola, Google Meet) instead. yt-dlp is for public, published media only.
Audio-only files (.m4a, .wav, .mp3) stored locally — route to mining.mind for local audio transcription via Whisper or equivalent.
Platforms outside YouTube — yt-dlp supports many platforms (Vimeo, SoundCloud, etc.) but each has quirks. Check yt-dlp --list-extractors and test before promising support.

Content-fuel library: /Claude/knowledge/intel/content-fuel/
Example capture: /Claude/knowledge/intel/content-fuel/2026-04-17-gtmnow-auren-hoffman-software-spend.md

Owner: cmo · Seeded: 2026-04-17