In Week 1 (extraction) and Week 2 (embeddings + KMeans in BigQuery ML) we laid the groundwork. This week I built a Python BERTopic stage whose IDs stay stable across runs by mapping BERTopic’s internal clusters to stable topic IDs in BigQuery. I use Google Gemini again to generate nice labels for the extracted topic clusters.
This week we explore BERTopic + stable topic IDs (via an ID registry):
- Train a BERTopic model in Python (UMAP + HDBSCAN).
- Map BERTopic’s internal clusters (modelversion, internaltopic_id)
- Ensure topic IDs remain consistent across retraining (no more ID jumps).
- Join human-readable labels and persist results into
video_topics
for analysis. - Inspect results in Looker Studio and reflect on limitations.