Merge Duplicates Dashboard
LiveAn admin tool for deduplicating beans and roasters across Firestore and WordPress — because the AI recommendation engine's accuracy depends on clean, unfragmented data.
The Problem
Hundreds of duplicate beans and roasters had accumulated in the database — "Onyx Coffee Lab" vs "Onyx Coffee" vs "Onyx Coffee Labs," all pointing to the same roaster but treated as separate entities. Every duplicate fragments the recommendation engine's statistics: instead of one roaster with 200 shots of data driving high-confidence recommendations, you get three partial records with weaker signals. The same problem existed for beans — slight naming variations from user input, AI enhancement, and WordPress imports created a growing mess that degraded recommendation quality over time.
My Role
Solo builder. I identified the problem through the analytics dashboard — recommendation confidence scores were lower than expected for popular roasters, and digging in revealed the fragmentation. I designed the merge logic, built the duplicate detection engine, created the review UI, and implemented the full cascade across Firestore and WordPress. Nobody else touched this system.
The Approach
Duplicate Detection
Simple exact-match wouldn't catch the real duplicates. I implemented fuzzy string matching combining Levenshtein distance with token similarity — this catches both typos ("Coffea Lab" vs "Coffee Lab") and reorderings ("Blue Bottle Coffee" vs "Coffee, Blue Bottle"). The algorithm scores candidate pairs and surfaces them ranked by match confidence, so the highest-probability duplicates get reviewed first.
Review with Context
Automated merging is too risky when you're touching a production database that feeds a recommendation engine. The dashboard presents duplicate candidates in a side-by-side comparison view with radio buttons for each field — pick the winner's name, the winner's origin, the winner's tasting notes. You see both records' full data before making any decision. Context matters: a roaster with 150 linked shots is clearly the "keeper" over a duplicate with 3.
Safe Execution
Every merge runs through a dry-run preview first — you see exactly which documents will be updated, moved, or archived before anything touches the database. Actual merges execute as batched Firestore writes for atomicity. Critically, all writes use updateSource: 'script' to safely bypass the 15+ Cloud Function triggers that would otherwise fire on each document change, preventing infinite trigger loops and unintended side effects.
Cascade Management
A bean merge isn't just updating one document. It cascades across 6+ Firestore collections: bean-batches, shots, bean-batch-history, bean-reviews, bean-statsv2, and enhancement tracking records. Every reference to the "loser" bean must be repointed to the "winner." Miss one collection and you get orphaned data or broken references. The merge executor walks each collection systematically, updating foreign keys in batched writes.
Cross-System Sync
Burrfect's data lives in both Firestore and WordPress (for SEO-driven discovery pages). Merging in Firestore alone would leave stale duplicates on the website. The merge executor queues WordPress REST API calls to update or remove the corresponding WordPress posts, keeping both systems consistent.
What I Built
- Duplicate detection engine — Levenshtein + token similarity scoring, ranked candidate pairs, configurable thresholds
- Side-by-side merge UI — Streamlit dashboard with radio buttons for field-level winner selection and full context display
- Batched merge executor — dry-run preview, atomic Firestore batch writes, trigger bypass via
updateSource: 'script' - 6+ collection cascade handler — systematic foreign key updates across bean-batches, shots, bean-batch-history, bean-reviews, bean-statsv2, and enhancement tracking
- WordPress sync queue — REST API integration to update or remove duplicate posts after Firestore merges
- Audit logging — every merge action recorded with before/after state for full traceability
- Archive system — loser records archived (never deleted) for recovery if a merge was wrong
The Result
Clean, deduplicated data feeding the recommendation engine. Roasters that were previously fragmented across multiple records now have their full shot history consolidated, producing higher-confidence recommendations. Every merge is fully auditable — I can trace exactly what changed, when, and why. And because losers are archived rather than deleted, any merge can be investigated or reversed if something looks wrong. The WordPress catalog stays in sync automatically, so users browsing the website see the same clean data as users in the app.
Tech Stack
- Frontend: Streamlit (Python)
- Backend: Firebase Admin SDK (Python), Firestore
- Duplicate Detection: Python-Levenshtein, custom token similarity
- Cross-System Sync: WordPress REST API
- Safety: Dry-run preview, batched writes,
updateSource: 'script'trigger bypass, archive-not-delete