In Week 1 (extraction), Week 2 (embeddings + KMeans), and Week 3 (stable topics with BERTopic) I built the foundations. This week applies the same idea to claims — using BERTopic to cluster claim snippets and keep stable claim_ids via a registry + dim table.
This week we explore BERTopic + stable claim IDs:
- Use pre-computed embeddings from BigQuery (same pipeline as before).
- Fit/Load a BERTopic model (UMAP + HDBSCAN) in Python.
- Assign internal cluster IDs per batch, then map them to stable claim_ids.
- Persist to video_claims,claim_registry, anddim_claimstables for analysis.
- Inspect behavior in Looker Studio and reflect on limitations.
 
    In bertopic, bigquery, clustering, embeddings, gcp, gemini, llm, looker-studio, machine-learning, python, research, topicwatchdog by DracoBlue @ 15 Oct 2025 | 3604 Words
 
    
    
    
        
        
        
        In Week 1 (extraction) and Week 2 (embeddings + KMeans in BigQuery ML) we laid the groundwork. This week I built a Python BERTopic stage whose IDs stay stable across runs by mapping BERTopic’s internal clusters to stable topic IDs in BigQuery. I use Google Gemini again to generate nice labels for the extracted topic clusters.
This week we explore BERTopic + stable topic IDs (via an ID registry):
- Train a BERTopic model in Python (UMAP + HDBSCAN).
 
- Map BERTopic’s internal clusters (modelversion, internaltopic_id)
- Ensure topic IDs remain consistent across retraining (no more ID jumps).
 
- Join human-readable labels and persist results into video_topicsfor analysis.
 
- Inspect results in Looker Studio and reflect on limitations.
 
    In bertopic, bigquery, clustering, embeddings, gcp, gemini, llm, looker-studio, machine-learning, python, research, topicwatchdog by DracoBlue @ 10 Sep 2025 | 3944 Words
 
    
    
    
        
        
        
        This post documents Week 2 of the TopicWatchdog project.
Last week we successfully extracted topics and claims from German political short videos and persisted them in BigQuery.
However, topics often appeared under slightly different names — making aggregation unreliable.
This week we explore embeddings + clustering:
- Generate embeddings of canonical topics and claims with BigQuery ML.
- Train a KMeans model on those embeddings to group semantically similar entries.
- Assign clusters back to each topic/claim.
- Inspect first results in Looker Studio and reflect on limitations.
 
    In bigquery, clustering, embeddings, gcp, gemini, kmeans, llm, looker-studio, machine-learning, research, topicwatchdog by DracoBlue @ 03 Sep 2025 | 1963 Words
 
    
    
    
        
        
        
        This post documents Week 1 of a research project I call TopicWatchdog: an end‑to‑end, reproducible pipeline that (a) collects German political short videos, (b) transcribes them, (c) extracts topics and claims with timestamps, and (d) persists everything in BigQuery for transparent, long‑term analysis.
The focus is on methods and reproducibility, not on polished production code. The snippets below are meant as guidance scaffolding, but already allow you to build a similar pipeline.
            
        
     
    In bigquery, gcp, gemini, llm, looker-studio, machine-learning, research, topicwatchdog, youtube by DracoBlue @ 27 Aug 2025 | 3495 Words
 
    
    
    
        
        
        
        When working with payload cms, I sometimes need to check what is
in the system collections of payload.
There is e.g. payload-preferences or payload-migrations. Since 3.0 there is also
payload-jobs for the neat queue system and payload-locked-documents for the document locking.
            
        
     
    In nodejs, payloadcms by DracoBlue @ 11 Jan 2025 | 278 Words