dracoblue.net

Week 3: Stable Topics with BERTopic

In Week 1 (extraction) and Week 2 (embeddings + KMeans in BigQuery ML) we laid the groundwork. This week I built a Python BERTopic stage whose IDs stay stable across runs by mapping BERTopic’s internal clusters to stable topic IDs in BigQuery. I use Google Gemini again to generate nice labels for the extracted topic clusters.

This week we explore BERTopic + stable topic IDs (via an ID registry):

  • Train a BERTopic model in Python (UMAP + HDBSCAN).
  • Map BERTopic’s internal clusters (modelversion, internaltopic_id)
  • Ensure topic IDs remain consistent across retraining (no more ID jumps).
  • Join human-readable labels and persist results into video_topics for analysis.
  • Inspect results in Looker Studio and reflect on limitations.
Continue reading ...

In bertopic, bigquery, clustering, embeddings, gcp, gemini, llm, looker-studio, machine-learning, python, research, topicwatchdog by DracoBlue @ 10 Sep 2025 | 3944 Words

Week 2: Embeddings & KMeans Clustering of Topics/Claims

This post documents Week 2 of the TopicWatchdog project.
Last week we successfully extracted topics and claims from German political short videos and persisted them in BigQuery.
However, topics often appeared under slightly different names — making aggregation unreliable.

This week we explore embeddings + clustering:

  • Generate embeddings of canonical topics and claims with BigQuery ML.
  • Train a KMeans model on those embeddings to group semantically similar entries.
  • Assign clusters back to each topic/claim.
  • Inspect first results in Looker Studio and reflect on limitations.
Continue reading ...

In bigquery, clustering, embeddings, gcp, gemini, kmeans, llm, looker-studio, machine-learning, research, topicwatchdog by DracoBlue @ 03 Sep 2025 | 1963 Words

Kickoff (Week 1): Extracting Topics & Claims from German Politics Videos

This post documents Week 1 of a research project I call TopicWatchdog: an end‑to‑end, reproducible pipeline that (a) collects German political short videos, (b) transcribes them, (c) extracts topics and claims with timestamps, and (d) persists everything in BigQuery for transparent, long‑term analysis.

The focus is on methods and reproducibility, not on polished production code. The snippets below are meant as guidance scaffolding, but already allow you to build a similar pipeline.

Continue reading ...

In bigquery, gcp, gemini, llm, looker-studio, machine-learning, research, topicwatchdog, youtube by DracoBlue @ 27 Aug 2025 | 3495 Words

Show System Collections in Payload CMS

When working with payload cms, I sometimes need to check what is in the system collections of payload.

There is e.g. payload-preferences or payload-migrations. Since 3.0 there is also payload-jobs for the neat queue system and payload-locked-documents for the document locking.

Continue reading ...

In nodejs, payloadcms by DracoBlue @ 11 Jan 2025 | 278 Words

Debugging Directus Serverside Extensions

You can find in the directus docs a good documentation how to run a build of directus in docker. For developing and contributing to direcuts itself there is a good documentation on running directus locally.

But if you want to develop a directus endpoint extension locally, you might want to use the "breakpoint" feature of your IDE (e.g. vscode). Without the need to run the entire directus stack with pnpm in development mode.

Continue reading ...

In directus, nodejs, vscode, vue by DracoBlue @ 19 Aug 2023 | 660 Words

Page 1 - Page 2