yt-dlp¶
Purpose¶
CLI utility for pulling YouTube (and other platform) metadata and auto-generated captions without downloading video files. Primary use: extracting transcripts from YouTube videos and podcast episodes for the content-fuel library at /Claude/knowledge/intel/content-fuel/.
How we use it¶
- Transcript extraction for content fuel. When Rick or anyone on the team sends a YouTube link as a "good piece, worth reading," the CMO pulls the auto-captions, cleans them, and files the piece in
/Claude/knowledge/intel/content-fuel/with a frontmatter block and a "Takeaways for eco|monetize" summary at the top. - Podcast episode transcription when the episode is hosted on YouTube (many GTM/VC podcasts dual-publish).
- Metadata capture — title, duration, upload date, description, view count — via
--write-info-json.
Install¶
Installed on Rick's Mac 2026-04-17. No auth or credentials required for public videos.
Canonical commands¶
Pull auto-captions + metadata (no video download):
yt-dlp --write-auto-sub --write-sub --sub-lang en --skip-download \
--sub-format vtt --write-info-json \
-o "youtube-%(id)s-%(title)s.%(ext)s" \
"https://youtu.be/<VIDEO_ID>"
Produces three files:
- *.en.vtt — auto-generated captions in VTT format
- *.info.json — metadata (title, duration, description, channel, etc.)
- *.en.srt — if manual subtitles exist (often absent)
Check available subtitle tracks before downloading:
Cleaning VTT → markdown¶
Auto-caption VTT files are noisy — rolling captions duplicate text, timestamps embed inline, positional tags litter the content. Python one-liner pattern used in the Auren Hoffman transcript capture (2026-04-17):
import re
with open(vtt_path) as f:
lines = f.read().splitlines()
seen = set()
out = []
for ln in lines:
if not ln.strip(): continue
if ln.startswith(("WEBVTT", "Kind:", "Language:")): continue
if "-->" in ln: continue
ln = re.sub(r"<[^>]+>", "", ln).strip()
if not ln or ln in seen: continue
seen.add(ln)
out.append(ln)
text = re.sub(r"\s+", " ", " ".join(out)).strip()
# Then paragraph-break every ~800 chars at sentence boundaries
Workflow for content-fuel captures¶
- Pull VTT + info.json via yt-dlp (commands above)
- Clean VTT to plain text, paragraph-break for readability
- Load
info.jsonfor title, channel, duration, description - Create
/Claude/knowledge/intel/content-fuel/YYYY-MM-DD-{source}-{slug}.mdwith: - Frontmatter per
/Claude/knowledge/intel/content-fuel/README.md - "Takeaways for eco|monetize" section mined by CMO
- "Suggested content pipeline" table routing to Substack/LinkedIn/blog
- Raw transcript appended at the bottom
- Keep
.raw.vttand.info.jsonas sibling artifacts for auditability
Known limitations¶
- Auto-captions only unless the uploader has published manual subtitles. Auto-captions miss speaker attribution and garble technical terms and names.
- ffmpeg missing warning — yt-dlp will warn but still succeed for captions-only extraction. Install ffmpeg (
brew install ffmpeg) only if we move into audio/video download use cases. - No speaker diarization. For podcast transcripts where speaker attribution matters, cross-reference the cleaned transcript against the original video or use a diarization service downstream.
- Impersonation warning — benign for public-video caption pulls, but future platform changes may require installing impersonation targets. Monitor if caption pulls start failing.
When NOT to use yt-dlp¶
- Private meeting recordings — use the native meeting platform transcription (Sembly, Granola, Google Meet) instead. yt-dlp is for public, published media only.
- Audio-only files (.m4a, .wav, .mp3) stored locally — route to
mining.mindfor local audio transcription via Whisper or equivalent. - Platforms outside YouTube — yt-dlp supports many platforms (Vimeo, SoundCloud, etc.) but each has quirks. Check
yt-dlp --list-extractorsand test before promising support.
Related files¶
- Content-fuel library:
/Claude/knowledge/intel/content-fuel/ - Example capture:
/Claude/knowledge/intel/content-fuel/2026-04-17-gtmnow-auren-hoffman-software-spend.md
Owner: cmo · Seeded: 2026-04-17