Synthetic Training Data Factory¶
Generates labeled synthetic data — meeting transcripts, CRM records, signal extractions — for two purposes: (1) training and fine-tuning signal extraction models, and (2) populating demo environments with realistic but non-customer data.
Why Synthetic Data¶
- No real customer data needed for model training or demo scenarios
- Covers edge cases and rare signal types underrepresented in real data
- Controllable: generate specific scenarios (competitive mention, executive sponsor drop-off, budget freeze) on demand
- Safe to share with prospects during demos — no data privacy concerns
Outputs¶
| Artifact | Use |
|---|---|
| Synthetic meeting transcripts | Signal extraction training + demo |
| Synthetic CRM opportunity records | ERI demo data (Signal Capture, EcoTasks) |
| Labeled signal extractions | Model fine-tuning labeled dataset |
| Scenario packages | Specific demo storylines (healthcare CRO, tech VP Sales) |
Components¶
- Synthetic Meeting Transcript Generator — LLM-generated meeting transcripts with injected signals
- Scenario Engine — Parameterized scenario templates (ICP vertical, deal stage, signal density)
- Labeling Pipeline — Auto-labels generated transcripts with ground-truth signal annotations
Current State (2026-04-23)¶
- Synthetic Meeting Transcript Generator in active development (Epic: 11507545493)
- Feeds ERI demo data quality recovery (target 40+ Signal Capture records)
- EcoTask synthetic records also in scope for demo readiness
Related¶
- Signal Pipeline — Consumer of training data
- Meeting Intelligence — Synthetic transcripts test meeting intelligence layer
- Voice Transcription Pipeline — Synthetic audio can test Whisper pipeline
- ERI Demo Data Quality Mission