Merge Duplicates Dashboard

The Problem

Hundreds of duplicate beans and roasters had accumulated in the database — "Onyx Coffee Lab" vs "Onyx Coffee" vs "Onyx Coffee Labs," all pointing to the same roaster but treated as separate entities. Every duplicate fragments the recommendation engine's statistics: instead of one roaster with 200 shots of data driving high-confidence recommendations, you get three partial records with weaker signals. The same problem existed for beans — slight naming variations from user input, AI enhancement, and WordPress imports created a growing mess that degraded recommendation quality over time.

My Role

Solo builder. I identified the problem through the analytics dashboard — recommendation confidence scores were lower than expected for popular roasters, and digging in revealed the fragmentation. I designed the merge logic, built the duplicate detection engine, created the review UI, and implemented the full cascade across Firestore and WordPress. Nobody else touched this system.

The Approach

Duplicate Detection

Simple exact-match wouldn't catch the real duplicates. I implemented fuzzy string matching combining Levenshtein distance with token similarity — this catches both typos ("Coffea Lab" vs "Coffee Lab") and reorderings ("Blue Bottle Coffee" vs "Coffee, Blue Bottle"). The algorithm scores candidate pairs and surfaces them ranked by match confidence, so the highest-probability duplicates get reviewed first.

Review with Context

Automated merging is too risky when you're touching a production database that feeds a recommendation engine. The dashboard presents duplicate candidates in a side-by-side comparison view with radio buttons for each field — pick the winner's name, the winner's origin, the winner's tasting notes. You see both records' full data before making any decision. Context matters: a roaster with 150 linked shots is clearly the "keeper" over a duplicate with 3.

Safe Execution

Every merge runs through a dry-run preview first — you see exactly which documents will be updated, moved, or archived before anything touches the database. Actual merges execute as batched Firestore writes for atomicity. Critically, all writes use updateSource: 'script' to safely bypass the 15+ Cloud Function triggers that would otherwise fire on each document change, preventing infinite trigger loops and unintended side effects.

Cascade Management

A bean merge isn't just updating one document. It cascades across 6+ Firestore collections: bean-batches, shots, bean-batch-history, bean-reviews, bean-statsv2, and enhancement tracking records. Every reference to the "loser" bean must be repointed to the "winner." Miss one collection and you get orphaned data or broken references. The merge executor walks each collection systematically, updating foreign keys in batched writes.

Cross-System Sync

Burrfect's data lives in both Firestore and WordPress (for SEO-driven discovery pages). Merging in Firestore alone would leave stale duplicates on the website. The merge executor queues WordPress REST API calls to update or remove the corresponding WordPress posts, keeping both systems consistent.

What I Built

Duplicate detection engine — Levenshtein + token similarity scoring, ranked candidate pairs, configurable thresholds
Side-by-side merge UI — Streamlit dashboard with radio buttons for field-level winner selection and full context display
Batched merge executor — dry-run preview, atomic Firestore batch writes, trigger bypass via updateSource: 'script'
6+ collection cascade handler — systematic foreign key updates across bean-batches, shots, bean-batch-history, bean-reviews, bean-statsv2, and enhancement tracking
WordPress sync queue — REST API integration to update or remove duplicate posts after Firestore merges
Audit logging — every merge action recorded with before/after state for full traceability
Archive system — loser records archived (never deleted) for recovery if a merge was wrong

The Result

Clean, deduplicated data feeding the recommendation engine. Roasters that were previously fragmented across multiple records now have their full shot history consolidated, producing higher-confidence recommendations. Every merge is fully auditable — I can trace exactly what changed, when, and why. And because losers are archived rather than deleted, any merge can be investigated or reversed if something looks wrong. The WordPress catalog stays in sync automatically, so users browsing the website see the same clean data as users in the app.

Tech Stack

Frontend: Streamlit (Python)
Backend: Firebase Admin SDK (Python), Firestore
Duplicate Detection: Python-Levenshtein, custom token similarity
Cross-System Sync: WordPress REST API
Safety: Dry-run preview, batched writes, updateSource: 'script' trigger bypass, archive-not-delete