5 Tricks When A/B Testing Is Off The Table

Sometimes you cannot do A/B testing, but it does not mean we have to fly blind - there is a range of econometric methods that can illuminate the causal relationships at play.
comments

By Emily Glassberg Sands (Coursera), Duncan Gilchrist (Uber)


At least a dozen times a day we ask, “Does X drive Y?” Y is generally some KPI our companies care about. X is some product, feature, or initiative.

The first and most intuitive place we look is the raw correlation. Are users engaging with X more likely to have outcome Y? Unfortunately, raw correlations alone are rarely actionable. The complicating factor here is a set of other features that might affect both X and Y. Economists call these confounding variables.

In ed-tech, for example, we want to spend our energy building products and features that help users complete courses. Does our mobile app meet that criteria? Certainly the raw correlation is there: users who engage more with the app are on average more likely to complete. But there are also important confounders at play. For one, users with stronger innate interest in the product are more likely to be multi-device users; they are also more likely to make the investments required to complete the course. So how can we estimate the causal effect of the app itself on completion?

The knee-jerk reaction in tech is to get at causal relationships by running randomized experiments, commonly referred to as AB tests. AB testing is powerful stuff: by randomly assigning some users and not others an experience, we can precisely estimate the causal impact of the experience on the outcome (or set of outcomes) we care about. It’s no wonder that for many experiences — different copy or color, a new email campaign, an adjustment to the onboarding flow — AB testing is the gold standard.

For some key experiences, though, AB tests can be costly to implement. Consider rolling back your mobile app’s full functionality from a random subset of users. It would be confusing for your users, and a hit to your business. On sensitive product dimensions — pricing, for example — AB testing can also hurt user trust. And if tests are perceived as unethical, you might be risking a full-on PR disaster.

Here’s the good news: just because we can’t always AB test a major experience doesn’t mean we have to fly blind when it matters most. A range of econometric methods can illuminate the causal relationships at play, providing actionable insights for the path forward.

Econometric Methods

First, a quick recap of the challenge: we want to know the effect of X on Y, but there exists some set of confounding factors, C, that affects both the input of interest, X, and the outcome of interest, Y. In Stats 101, you might have called this omitted variable bias.

The solution is a toolkit of five econometric methods we can apply to get around the confounding factors and credibly estimate the causal relationship:

  1. Controlled Regression
  2. Regression Discontinuity Design
  3. Difference-in-Differences
  4. Fixed-Effects Regression
  5. Instrumental Variables (with or without a Randomized Encouragement Trial)

This post will be applied and succinct. For each method, we’ll open with a high-level overview, run through one or two applications in tech, and highlight major underlying assumptions or common pitfalls.

Some of these tools will work better in certain situations than others. Our goal is to get you the baseline knowledge you need to identify which method to use for the questions that matter to you, and to implement effectively.

Let’s get started.

Method 1: Controlled Regression

The idea behind controlled regression is that we might control directly for the confounding variables in a regression of Y on X. The statistical requirement for this to work is that the distribution of potential outcomes, Y, should be conditionally independent of the treatment, X, given the confounders, C.

Let’s say we want to know the impact of some existing product feature — e.g., live chat support — on an outcome, product sales. The “why we care” is hopefully self-evident: if the impact of live chat is large enough to cover the costs, we want to expand live chat support to boost profits; if it’s small, we’re unlikely to expand the feature, and may even deprecate it altogether to save costs.

We can easily see a positive correlation between use of chat support and user-level sales. We also probably have some intuition around likely confounders. For example, younger users are more likely to use chat because they they are more comfortable with the chat technology; they also buy more because they have more slush money.

Since youth is positively correlated with chat usage and sales, the raw correlation between chat usage and sales would overstate the causal relationship. But we can make progress by estimating a regression of sales on chat usage controlling for age. In R:

fit <- lm(Y ~ X + C, __data we do have. If we see that adding the proxy to the regression meaningfully impacts the coefficient on the primary regressor of interest, X, there’s a good chance regression won’t suffice. We need another tool.

But before moving on, let’s briefly cover the concept of bad controls, or why-we-shouldn’t-just-throw-in-the-kitchen-sink-and-see-what-sticks (statistically). Suppose we were concerned about general interest in the product as a confounder: the more interested the user is, the more she engages with our features, including live chat, and also the more she buys from us. We might think that controlling for attributes like the proportion of emails from us she opens could be used as a proxy for interest. But insofar as the treatment (engaging in live chat) could in itself impact this feature (e.g., because she wants to see follow-up responses from the agent), we would actually be inducing included variable bias. The take-away? Be wary of controlling for variables that are themselves not fixed at the time the “treatment” was determined.

Method 2: Regression Discontinuity Design

Regression discontinuity design, or RDD, is a statistical approach to causal inference that takes advantage of randomness in the world. In RDD, we focus in on a cut-off point that, within some narrow range, can be thought of as a local randomized experiment.

Suppose we want to estimate the effect of passing an ed-tech course on earned income. Randomly assigning some people to the “passing” treatment and failing others would be fundamentally unethical, so AB testing is out. Since several hard-to-measure things are correlated with both passing a course and income — innate ability and intrinsic motivation to name just two — we also know controlled regression won’t suffice.

Luckily, we have a passing cutoff that creates a natural experiment: users earning a grade of 70 or above pass while those with a grade just below do not. A student who earns a 69 and thus does not pass is plausibly very similar to a student earning a 70 who does pass. Provided we have enough users in some narrow band around the cutoff, we can use this discontinuity to estimate the causal effect of passing on income. In R:

library(rdd)
RDestimate(Y ~ D, data = ..., subset = ..., cutpoint = ...)

Here’s what an RDD might look like in graphical form:

Let’s introduce the concepts of validity:

  • Results are internally valid if they are unbiased for the subpopulation studied.
  • Results are externally valid if they are unbiased for the full population.

In randomized experiments, the assumptions underlying internal and external validity are rather straightforward. Results are internally valid provided the randomization was executed correctly and the treatment and control samples are balanced. And they are externally valid so long as the impact on the experimental group was representative of the impact on the overall population.

In non-experimental settings, the underlying assumptions depend on the method used. Internal validity of RDD inferences rests on two important assumptions:

Assumption 1: Imprecise control of assignment around the cutoff. Individuals cannot control whether they are just above (versus just below) the cutoff.

Assumption 2: No confounding discontinuities. Being just above (versus just below) the cutoff should not influence other features.

Let’s see how well these assumptions hold in our example.

Assumption 1: Users cannot control their grade around the cutoff. If users could, for example, write in to complain to us for a re-grade or grade boost, this assumption would be violated. Not sure either way?

Here are a couple ways to validate:

  1. Check whether the mass just below the cutoff is similar to the mass just above the cutoff; if there is more mass on one side than the other, individuals may be exerting agency over assignment.
  2. Check whether the composition of users in the two buckets looks otherwise similar along key observable dimensions. Do exogenous observables predict the bucket a user ends up in better than randomly? If so, RDD may be invalid.

Assumption 2: Passing is the only differentiator between a 69 and a 70. If, for example, users who get a 70 or above also get reimbursed for their tuition by their employer, this would generate an income shock for those users (and not for users with lower scores) and violate the no confounding discontinuities assumption. The effect we estimate would then be the combined causal effect of passing the class and getting tuition reimbursed, and not simply the causal effect of passing the class alone.

What about external validity? In RDD, we’re estimating what is called a local average treatment effect (LATE). The effect pertains to users in some narrow, or “local”, range around the cutoff. If there are different treatment effects for different types of users (i.e., heterogeneous treatment effects), then the estimates may not be broadly applicable to the full group. The good news is the interventions we’d consider would often occur on the margin — passing marginally more or marginally fewer learners — so a LATE is often exactly what we want to measure.

Method 3: Difference-in-Differences

The simplest version of difference-in-differences (or DD) is a comparison of pre and post outcomes between treatment and control groups.

DD is similar in spirit to RDD in that both identify off of existing variation. But unlike RDD, DD relies on the existence of two groups — one that is served the treatment (after some cutoff) and one that never is. Because DD includes a control group in the identification strategy, it is generally more robust to confounders.

Let’s consider a pricing example. We all want to know whether we should raise or lower price to increase revenue. If price elasticity (in absolute value) is greater than 1, lowering price will increase purchases by enough to increase revenue; if it’s less than 1, raising prices will increase revenue. How can we learn where we are on the revenue curve?

The most straightforward method would be a full randomization through AB testing price. Whether we’re comfortable running a pricing AB test probably depends on the nature of our platform, our stage of development, and the sensitivity of our users. If variation in price would be salient to our users, for example because they communicate to each other on the site or in real life, then price testing is potentially risky. Serving different users different prices can be perceived as unfair, even when random. Variation in pricing can also diminish user trust, create user confusion, and in some cases even result in a negative PR storm.

A nice alternative to AB testing is to combine a quasi-experimental design with causal inference methods.

We might consider just changing price and implementing RDD around the date of the price change. This could allow us to estimate the effect of the price on the revenue metric of interest, but has some risks. Recall that RDD assumes there’s nothing outside of the price change that would also change purchasing behavior over the same window. If there’s an internal shock— a new marketing campaign, a new feature launch — or, worse, an external shock — e.g., a slowing economy — we’ll be left trying to disentangle the relative effects. RDD results risk being inconclusive at best and misleading at worst.

A safer bet would be to set ourselves up with a DD design. For example, we might change price in some geos (e.g., states or countries), but not others. This gives a natural “control” in the geos where price did not change. To account for any other site or external changes that might have co-occurred, we can use the control markets — where price did not change — to calculate the counterfactual we would have expected in our treatment markets absent the price change. By varying at the geo (versus user) level, we also reduce the salience of the price variation relative to AB testing.

Below is a graphical representation of DD. You can see treatment and control geos in the periods pre and post the change. The delta between the actual revenue (here per session cookie) in the treatment group and the counterfactual revenue in that group provides an estimate of the treatment effect:

While the visualization is helpful (and hopefully intuitive), we can also implement a regression version of DD. At its essence, the DD regression has an indicator for treatment status (here being a treatment market), an indicator for post-change timing, and the interaction of those two indicators. The coefficient on the interaction term serves is the estimated effect of the price change on revenue. If there are general time trends, we can control for those in the regression, too. In R:

fit <- lm(Y ~ post + treat + I(post * treat), data = ...)
summary(fit)