Lester Leong
How to Build a Product Engagement Score From Scratch
Why You Need a Single Engagement Number
Every product team I have worked with reaches the same inflection point. They have event data. They have DAU counts, feature usage logs, session durations, and click streams. What they do not have is a single number that answers: "How engaged is this user?"
Without that number, engagement conversations become anecdotal. Someone pulls a chart showing session duration is up. Someone else points out that a core feature's usage is declining. Both are right, and neither can resolve the disagreement because there is no shared definition of what "engaged" means for their product.
A product engagement score solves this by collapsing multiple behavioral signals into one composite metric per user, per time period. It gives you a common language for segmentation ("high engagement" vs. "at risk"), a leading indicator for retention, and a dependent variable you can optimize against.
I have built engagement scoring systems across roughly 20 consulting engagements through Gradient Growth, spanning SaaS, fintech, and marketplace products. Before consulting, I built the engagement scoring system at a financial social media startup prior to its acquisition. That score became central to how we identified power users, predicted churn, and ultimately told the retention story that mattered during due diligence. Now, working on a GenAI squad at a major finance technology company, I see engagement scoring applied at a scale where the methodology must be rigorous because the downstream decisions affect millions of users.
Here is the framework I use every time.
Step 1: Choose Your Input Events
The first decision is which user actions feed into the score. This is where most teams go wrong. They include too many events, diluting the signal, or too few, missing critical behavior.
The selection criteria I use:
The event must be intentional. Page views and session starts are passive. They happen when a user opens your app, not when a user chooses to do something. Passive events inflate engagement scores for users who open the app, stare at the home screen, and leave. Filter them out. Use events that require a deliberate action: creating content, completing a workflow, inviting a collaborator, configuring a setting.
The event must correlate with retention. Pull your event data for a 90-day window. For each event type, compute the correlation between "user performed this event at least once in week 1" and "user retained at week 8." Rank the events by this correlation. The top 5 to 8 events are your candidates. At a B2B SaaS client, this analysis revealed that "exported a report" had a 0.61 correlation with 8-week retention, while "viewed dashboard" (which the team assumed was their most important engagement signal) correlated at 0.14. The dashboard view was table stakes. The export was the value delivery moment.
The event must have meaningful variance. If 95% of active users perform an event, it does not differentiate engagement levels. If 2% perform it, it is too rare to contribute signal. Target events where 15 to 70% of your active user base engages within your measurement window.
For most products, I end up with 5 to 7 input events. At the financial social media startup, our score used six: portfolio update, social post created, comment written, user followed, insight shared, and watchlist modified. Each one represented a distinct engagement mode (portfolio management, content creation, social interaction, and discovery).
Step 2: Assign Weights
Not all actions signal equal engagement depth. Following someone is low effort. Writing a detailed post is high effort. The score should reflect that asymmetry.
I use two approaches depending on data maturity:
Data-driven weights (preferred). Run a logistic regression where the target variable is "retained at week 8" (or whatever retention window matters for your product) and the features are per-user event counts for each input event during the scoring window. The coefficients become your weights. This approach lets the data tell you which actions actually predict retention, rather than relying on intuition.
At one marketplace client, the team assumed that "completed a purchase" should carry the highest weight. The regression revealed that "saved an item to a list" was a stronger retention predictor. Saving signaled intent and ongoing interest. Purchasing sometimes signaled a one-time need that, once fulfilled, eliminated the reason to return. That finding reshaped their entire engagement strategy.
Expert-informed weights (when data is sparse). For early-stage products without enough retention data, assign weights on a 1 to 5 scale based on effort and value signal. Core value actions (the reason the user signed up) get a 5. Supporting actions (social, configuration) get a 2 or 3. Discovery actions (browsing, searching) get a 1. Revisit these weights quarterly as you accumulate data, and transition to the data-driven approach as soon as your retention cohorts are large enough.
Step 3: Compute and Normalize
Here is the Python implementation I use as a starting point. It takes raw event data, applies weights, computes a raw score per user, and normalizes to a 0 to 100 scale.
```python import pandas as pd import numpy as np from sklearn.preprocessing import MinMaxScaler
def compute_engagement_score( events: pd.DataFrame, weights: dict[str, float], window_days: int = 28, user_col: str = "user_id", event_col: str = "event_name", date_col: str = "event_date", ) -> pd.DataFrame: """ Build a weighted engagement score per user from raw event data.
Parameters ---------- events : DataFrame with columns [user_id, event_name, event_date]. weights : mapping of event_name -> numeric weight. Only events present in this dict are included. window_days : rolling window for score computation. user_col : name of the user identifier column. event_col : name of the event name column. date_col : name of the date column (coerced to datetime).
Returns ------- DataFrame with columns [user_id, raw_score, engagement_score] where engagement_score is normalized to 0-100. """ events = events.copy() events[date_col] = pd.to_datetime(events[date_col])
# filter to scoring window cutoff = events[date_col].max() - pd.Timedelta(days=window_days) recent = events[events[date_col] >= cutoff]
# keep only weighted events scored_events = recent[recent[event_col].isin(weights)]
# count events per user per event type counts = ( scored_events .groupby([user_col, event_col]) .size() .reset_index(name="count") )
# apply log dampening to prevent power users from skewing the scale counts["dampened"] = np.log1p(counts["count"])
# apply weights counts["weighted"] = ( counts["dampened"] * counts[event_col].map(weights) )
# aggregate per user user_scores = ( counts .groupby(user_col)["weighted"] .sum() .reset_index(name="raw_score") )
# normalize to 0-100 scaler = MinMaxScaler(feature_range=(0, 100)) user_scores["engagement_score"] = ( scaler.fit_transform(user_scores[["raw_score"]]) .round(1) .flatten() )
return user_scores.sort_values("engagement_score", ascending=False) ```
A few implementation notes:
Log dampening is not optional. Without it, a user who triggers 500 events in a category dominates the scale, compressing everyone else toward zero. `log1p` preserves the rank order while reducing the effect of extreme outliers. At the startup, our initial score without dampening had a distribution where the top 2% of users occupied 60% of the score range. After log dampening, the distribution spread across the full 0 to 100 range, which is what you need for meaningful segmentation.
The 28-day window aligns with monthly business cycles. For products with shorter natural cycles (daily tools, communication apps), a 7 or 14-day window may be more responsive. For products with longer cycles (quarterly reporting tools, tax software), extend to 60 or 90 days. Match the window to how frequently your users naturally interact.
MinMaxScaler is sensitive to outliers even after dampening. For production systems, consider using percentile-based normalization (assign the score based on the user's percentile rank) instead of min-max. Percentile normalization is more stable across time periods because it is not affected by a single extreme user.
Step 4: Validate Against Retention
A score that does not predict retention is a vanity metric with extra math. Validation is the step that separates a useful engagement score from a meaningless index.
The validation process:
1. Compute the engagement score for a historical cohort (users who signed up 90+ days ago). 2. Segment users into quartiles by engagement score during their first 28 days. 3. Measure retention at day 60 and day 90 for each quartile. 4. The top quartile should retain at 2x or higher the rate of the bottom quartile.
If the separation is weaker than 2x, the score is not capturing the right signals. Go back to step 1 and re-examine your input events, or go back to step 2 and switch to data-driven weights.
At the financial social media startup, our final engagement score separated retention cleanly: the top quartile retained at 74% at day 60, the bottom quartile at 18%. That 4x separation gave us confidence that the score was measuring something real. During consulting engagements, I typically see 2x to 3x separation on the first iteration, improving to 3x to 5x after one round of weight tuning.
One consulting client (a SaaS tool for small teams) initially built a score using five events. The retention separation was only 1.4x. We dug into the data and found that their highest-weighted event ("created a project") happened once per user in the first week regardless of whether they retained. It was an onboarding step, not an engagement signal. After removing it and replacing it with "added a second team member" (which had strong retention correlation), the separation jumped to 2.8x.
Step 5: Operationalize
A score that lives in a notebook is a research project. A score that drives decisions is a product tool. Here is how to move from one to the other.
Segment users into tiers. I use four tiers: Power (75 to 100), Active (50 to 74), Casual (25 to 49), and At Risk (0 to 24). These labels give product managers, customer success teams, and marketers a shared vocabulary. When someone says "our At Risk segment grew by 8% this month," the entire team understands the urgency without needing to interpret a raw number.
Track tier migration weekly. The most actionable view is not the score distribution itself but the movement between tiers. How many users moved from Active to At Risk this week? How many moved from Casual to Active? Tier migration is your early warning system. A spike in downward migration from Active to Casual is visible weeks before it shows up in [DAU/MAU stickiness](/insights/dau-mau-ratio-stickiness) or churn metrics.
Connect to interventions. Each tier should trigger different product and marketing responses. At Risk users get re-engagement campaigns. Casual users get feature education (they may not know about high-value features). Active users get nudges toward power behavior. Power users get referral prompts and beta access. Without this operational mapping, the score is just a dashboard decoration.
Refresh the model quarterly. User behavior changes as your product evolves. New features shift which events matter. Seasonal patterns affect baselines. Re-run the weight calibration and retention validation every quarter to keep the score aligned with current behavior. At one client, a major feature launch in Q3 made one of the original scoring events obsolete (it was replaced by a better workflow). The score degraded silently until the quarterly review caught it.
When a Composite Score Is the Wrong Choice
I want to be direct about when this approach is not appropriate. If your product has a single, clear [North Star metric](/insights/north-star-metric-framework) that already predicts retention well, adding a composite engagement score introduces complexity without proportional value. A North Star tells you "optimize this one thing." An engagement score tells you "here is the overall picture." You need the overall picture when no single metric captures engagement adequately, which is the case for most products with multiple engagement modes.
Also, engagement scores are trailing indicators by nature. They summarize what users did over the past N days. For real-time decisions (like triggering an in-app message the moment a user shows signs of disengagement), you need event-level signals, not composite scores. The score is best suited for weekly and monthly strategic analysis, cohort segmentation, and retention modeling.
The Engagement Score as a Retention Predictor
The real value of a product engagement score is not the number itself. It is the predictive power. A well-calibrated score lets you see retention problems forming weeks before they appear in the retention curve. It lets you quantify the impact of a feature launch on user engagement within days, not months. And it gives you a single axis for segmenting your user base that is grounded in behavioral reality rather than arbitrary demographics.
The teams I have seen get the most from this framework are the ones who treat the score as a living system: continuously validated, quarterly recalibrated, and operationally connected to the interventions that keep users engaged.
---
I help teams build engagement measurement systems that predict retention before it shows up in the numbers. [lester@gradientgrowth.com](mailto:lester@gradientgrowth.com)