Burrfect PMF Dashboard

Live

A 13-page PMF validation dashboard backed by 20+ BigQuery views — measuring whether the AI recommendation engine actually improves user outcomes, from cohort retention to activation breakpoints to A/B experiment significance.

Streamlit BigQuery Plotly GA4 Burrfect

The Problem

My cofounder built an AI recommendation engine for Burrfect — 12 models helping users dial in better espresso shots. It was serving real users in production. But we had no way to measure whether it was actually working. Were recommendations improving retention? At what point do users activate? Where does the onboarding funnel break? Which experiments moved the needle?

GA4's built-in reports couldn't answer these questions. We needed real-time, product-specific analytics that connected Firebase Auth identities to GA4 event streams and computed metrics tailored to our domain — things like shot-count activation breakpoints, recommendation adoption rates, and cohort-level retention with statistical significance on A/B tests.

My Role

Solo builder. I designed the analytics methodology, wrote every BigQuery SQL view, built the Streamlit dashboard, and deployed it on Appliku with Tailscale for remote access. My cofounder built the recommendation engine; I built the system that measures whether it works.

The Approach

Foundation: The Identity Bridge

The first challenge was connecting two data worlds. Firebase Auth has user IDs. GA4 has anonymous event streams with its own user identifiers. Without linking them, I couldn't answer the most basic question: "Do users who follow recommendations retain better than those who don't?"

I built a set of identity bridge views in BigQuery that join Firebase Auth UIDs to GA4 events, creating a unified user timeline. Every downstream analysis depends on this layer.

Retention as North Star

Retention cohort heatmaps became the core metric. I built weekly and monthly cohort views that show exactly when users drop off, segmented by acquisition channel, feature adoption, and recommendation engagement. These aren't vanity metrics — they're the basis for every product decision.

Activation Research

The key question: how many shots does a user need to log before they "get it" and stick around? I built breakpoint analysis that tests different shot-count thresholds against long-term retention, identifying the activation moment. This directly informed onboarding design — if users need to log N shots to activate, the onboarding flow needs to get them there fast.

A/B Testing Infrastructure

Every experiment needs guardrail metrics, not just the target metric. I built A/B experiment pages that compute statistical significance (using scipy) alongside guardrail metrics — making sure a change that improves one metric doesn't quietly degrade another.

What I Built

13 analysis pages, each backed by dedicated BigQuery views:

  1. Overview — key metrics at a glance, trend lines, health indicators
  2. Retention Cohorts — weekly/monthly cohort heatmaps with segmentation
  3. Usage Density — when and how often users engage, time-of-day and day-of-week patterns
  4. Power User Curves — distribution of engagement intensity across the user base
  5. Activation Research — shot-count breakpoint analysis against long-term retention
  6. Onboarding Funnel — step-by-step drop-off from install through first value moment
  7. Conversion Funnels — free-to-paid conversion paths and bottlenecks
  8. Lifecycle Stages — user classification (new, active, at-risk, dormant, churned)
  9. Resurrection Rates — who comes back after going dormant, and what triggers it
  10. Natural Frequency — organic usage cadence without nudges, the "true" engagement rhythm
  11. Notification Impact — push notification effectiveness on engagement and retention
  12. A/B Experiments — experiment results with statistical significance and guardrail metrics
  13. VC Benchmarks — our metrics compared against published PMF benchmarks for consumer apps

The Result

The dashboard identified specific shot-count breakpoints that drive retention — giving us a concrete activation target for onboarding. A/B test results showed measurable impact from recommendation engine changes with statistical significance, replacing gut-feel product decisions with data-driven ones. The VC benchmarks page gave us an honest read on where we stand against published PMF thresholds.

Most importantly, the dashboard answered the question we built it for: is the AI recommendation engine actually improving user outcomes? We can now segment retention by recommendation adoption, measure the delta, and make informed decisions about where to invest engineering effort.

Tech Stack

  • Dashboard: Streamlit, Plotly (interactive charts)
  • Data Pipeline: BigQuery (20+ SQL views), GA4 event export, Firebase/Firestore
  • Statistics: scipy (significance testing, confidence intervals)
  • Deployment: Appliku (Docker), Tailscale (remote access)