Experience Platform (ExP) header - two flask icons

Experimentation Platform

Why Tenant-Randomized A/B Test is Challenging and Tenant-Pairing May Not Work

分享这个页面

When we run A/B tests, we choose the randomization unit[1].  For example, if we want to evaluate whether a product feature increases user engagement, we will run an A/B test and randomize by user. Typically, A/B tests on websites, mobile apps, or consumer-facing products use the individual user or its proxy (e.g. an anonymized id based on a browser cookie) as the randomization unit. For applications used by enterprises, however, it can be impractical to expose users within an organization to different variants. Let’s say we have a collaboration feature that allows users to chat in a window. If someone in a group can initiate a chat while others cannot, it likely will cause confusion and end up in dis-satisfaction with the product. Typically, enterprise customers expect a consistent experience for all users within the organization. In such cases, randomizing by user no longer works. So how do we bring the benefits of A/B testing to an environment with such constraints? We need an alternative randomization unit – tenant-randomized A/B tests.

In this post, we will talk about:

  • What is a tenant-randomized A/B test?
  • Why tenant-randomized A/B tests can have challenges in variance computation, metric power and experiment balance?
  • How can the delta method, CUPED variance reduction and SeedFinder help address these challenges?
  • What kind of things should we consider before starting a tenant-randomized A/B test?
  • Why tenant-pairing may not be helpful for tenant-randomized A/B test?
  • What have we learned from our studies at Microsoft?

What is a tenant-randomized A/B test?

A tenant-randomized A/B test, as the name suggests, is an A/B test that randomly splits software product instances used by tenants into Treatment and Control. A “tenant” is  a group of users that require a consistent experience with the product, such as users belong to the same organization, i.e. company, government agency, or school, etc., or users on the same server or in the same data center. Sometimes, tenant-randomized A/B tests are also called “enterprise-level controlled experiments”[2], or “cluster-based randomized experiments”[3]. All users in the same tenant are assigned to the same variant and users from different tenants may see different variants. While this seems a simple change of randomization unit, tenant-randomized A/B test has its own challenge in analysis and execution.

What is tenant-randomization

What are the challenges of tenant-randomized A/B tests?

Variance Computation. Within a tenant, the user activities with a product are not independent and identically distributed (i.i.d)[4]. Users in a financial organization are likely to have similar behaviors, and users within a tech company tend to have similar patterns of product usage. For user-level metrics (i.e. average number of certain activities per user, etc.) in tenant-randomized A/B tests, the i.i.d assumption no longer holds and the naïve variance calculation will likely underestimate variance, leading to false detection of changes that are actually within normal variation[6]. We need to use some alternative for variance computation, which we will discuss in the solution section below.

Metric Power. Randomizing by tenant drastically reduces the sample size. Suppose you have 500k users of your product and on average each tenant has about 100 users. You will end up with only 5k tenants in your experiment. This can affect your metric power. In our recent research on Microsoft products, we observed significant power loss on metrics in tenant-randomized A/B tests – they are only able to detect treatment effect that are 10x or larger than user-randomized A/B tests.

Experiment Balance. As the sample size shrinks, the experiment becomes more susceptible to the effect of outliers[5]. A tenant is an outlier if its users interact with the product very differently from other tenants’ users. It can also be that the number of users within the tenant is drastically different from others. In many cases, we observed users are distributed disproportionately across small to large tenants and there is a long-tailed distribution of tenant size. If such tenants are not split in balance between Treatment and Control, they can lead to a biased estimate of the treatment effect.

What are possible solutions?

At ExP, we have developed solutions to address metric power, variance estimation and balance issues. These solutions help tackle the challenges in tenant-randomized A/B tests and can be leveraged in other experimentation scenarios as well.

Delta Method

The Delta Method is a simple technique for correcting variance estimation when the randomization and analysis units don’t match. It takes advantage of the fact that a user level metric (for example “average number of sessions per user”) can either be calculated directly from usage data or from tenant level sums. Imagine a case with two tenants and two users in each:

Adding up over users:

is the same as adding up over tenants:

And because the second version only relies on (i.i.d.) tenant-aggregated data, it is possible to accurately estimate the variance (though this requires a Taylor expansion)[6].

CUPED Variance Reduction

CUPED stands for Controlled-experiment Using Pre-Experiment Data. It is a way to define variance-reduced (a.k.a. VR) metrics and has been found effective and broadly applicable in A/B tests we run at Microsoft[1]. The central idea of variance reduction is to replace a metric Y with an alternate metric Y’ such that E(Y’)=E(Y) but Var(Y’) < Var(Y). Using Y’ instead of Y can have improved metric power.

CUPED VR leverages the variance reduction technique from Monte Carlo simulations and uses pre-experiment data to remove explainable variance through a linear-based model [7]. The variance-reduced metric is also referred to as “regression adjusted metric”. It accounts for the relationship between pre-experiment and in-experiment values and detects treatment effects that alter this relationship. Using VR metric can detect a difference that is not detected by the original metric as well as prevent a false positive when a seeming difference is explained from the pre-experiment period.

As the name suggests, CUPED VR requires data collection and retention of data prior to treatment exposure. Also, it is effective when there is a significant overlap in the randomization unit between pre-experiment and in-experiment periods. Typically, we suggest having a pre-experiment period of 1-2 weeks for variance reduction. As the pre- and in-experiment period matching is dependent on the stability of the unit of randomization over time, too short a period will lead to poor matching, whereas too long a period will reduce correlation with the outcome metric during the experiment period.

When the analysis unit is different from the randomization unit, a formulation of CUPED VR in conjunction with the Delta method is necessary.

SeedFinder

SeedFinder addresses the problem of possible imbalance between Treatment and Control. It tries hundreds of seeds for randomization, evaluates metrics based on pre-experiment data, and finds the seed that results in minimum bias. The optimal seed will be used for randomization during the experiment. Choosing appropriate metrics for SeedFinder, such as metrics indicating tenant characteristics, can help establish a balanced split between treatment and control and help us draw reliable conclusions from experiment results[1].

What to consider before starting a tenant-randomized A/B test?

Sometimes, with the solutions outlined above, it is still challenging to run tenant-randomized A/B tests. This is partly due to limited metric power even after applying these solutions, and also due to inherent problems with the infrastructure and data logging. To determine whether it’s feasible to run a tenant-randomized A/B test, we recommend going through the checklist below.

Reconsider whether a tenant-randomized A/B test is necessary

  • What is the feature to be tested? Does it require a consistent experience for all users within a tenant? (If not, it might be better to run a user-randomized A/B test instead.)

Understand Telemetry and Flight Assignment

  • Which field identifies (pseudonymously) a tenant?
  • Are there empty tenant identifiers? If yes, what is the reason for that? Should they be excluded in an A/B test?
  • How many distinct tenants and users are there?
  • Are there tenants with only one user? What is the proportion of tenants with only one user or a small number of users? Should they be considered for an A/B test?
  • Are there users associated with multiple tenants? If so, what is the reason for that? Should product sessions associated with such users be excluded from analysis?
  • Do we run an AA test to confirm balance and that each tenant is assigned to only one flight?

Metric Design

  • Which metrics should be built to evaluate the feature?
  • Which of these are tenant-level metrics, and which are user-level? (User-level/single-average metrics primarily reflect the impact on large tenants, while tenant-level/double-average metrics primarily reflect the impact on small tenants.)
  • Does the experimentation platform support Delta method for metric variance estimation? If not, only tenant-level metrics can be applied.
  • What is the minimum detectable treatment effect for each metric?

Improve Metric Sensitivity

  • Is there a significant overlap of tenants between pre-experiment and in-experiment period? If the answer is yes, CUPED VR can be applied.
  • What improvements are possible after applying variance reduction? Is it big enough that the feature effect can be detected?

Validity Check

  • Are there any “outlier” tenants? Could they cause metrics to move if they are not split in balance between Treatment and Control in A/A test?
  • Does the experimentation platform support SeedFinder method to look for a good seed for randomization?

What about Tenant-Pairing?

Method. Ideally, we want to compare tenants with similar attributes. We want to make sure the mixture of tenants in Control is similar to the mixture of tenants in Treatment. One option is to consider using pairing to improve balance. In tenant-pairing, we group all tenants into pairs, where the two tenants in the same pair are “similar” to each other. The “similarity” can be defined by a selective set of attributes, i.e. geography, industry etc., or based on machine learning results where encoding, clustering and approximate nearest neighbor methods are applied. When making variant assignment, we randomly select one tenant from each pair to assign it to Treatment and assign the other tenant to Control.

Analysis. Such paired variant assignment can be treated as “constrained randomization”, which is different from randomizing by tenant, a “pure randomization”. Therefore, we won’t be able to use a standard two-sample t-test to measure the significance of metric movements. Instead, we have to use a paired t-test. The idea is to calculate the difference on the metric value between treatment and control in each pair and conduct a one-sample t-test on this sample of differences against zero.

Problem. We tried this approach in our research; however, we discovered that tenant-pairing consistently fails to provide better assignment balance or metric sensitivity compared to regular tenant-based variant assignment. The major issue is the data loss introduced by tenant-pairing. Tenant-pairing relies on historical data and the result only contains tenants who had product usage in the historical period. If we have new tenants coming to use the product during experiment, they are not included in the pairing result and thus won’t be randomized. In addition, if one tenant in a pair does not show up in the experiment, its counterpart will have to be excluded from the analysis. In fact, using tenant-pairing caused 50+% data loss in the tests we analyzed.

Other limitations. Pairing-based randomization suffers from other limitations. A naïve paired t-test cannot be applied at any sub-randomization unit level. Also, pairing-based randomization requires extra effort when running segmented analysis, which slices data based on a specific dimension, e.g. platform, region, etc. to identify treatment effect for sub-groups of the population. For segmented analysis, the two tenants in the same pair must belong to the same segment. Otherwise, they will have to be both abandoned at the segment level, causing further data loss.

Infrastructure-wise, tenant-pairing requires significant engineering work and maintenance. If tenant-pairing is based on a machine learning algorithm, the model parameters need to be tuned and refreshed regularly as they can become stale as time goes by. In addition, enabling tenant-pairing requires infrastructure change on existing experimentation platform, such as supporting paired t-test in addition to standard t-test, etc.

Recommendation. Evaluating these factors for real products, we have so far recommended randomized assignment over paired randomization in tenant-randomized A/B tests.

Other considerations

In our research, we applied the delta method, CUPED VR, and SeedFinder to address the challenges mentioned above. We found that using CUPED VR (in conjunction with the Delta method for user-level metrics) greatly improves the metric sensitivity for both tenant-level and user-level metrics. However, the improved metric sensitivity is still below what we can achieve in user-randomized experiments. We expanded the study to look at different approaches that may provide further improvement. These include truncating the count-based metrics (excluding values out of the 99th percentile based on 7-day data), excluding 1-user tenants, and using CUPED VR in combination with aforementioned approaches.

The results showed that using a combination of truncation and VR provides the biggest boost on metric sensitivity, for both tenant-level and user-level metrics. Most count-based metrics (e.g. “sessions per user”) have 80+% improvement and the user-level metrics see improvement as big as 97%. Excluding 1-user tenants, however, does not show consistent effect. For some products, it improves the metric sensitivity. For others, the variance of the metrics is mostly driven by large tenants, and not small ones. In such scenarios, removing 1-user tenants actually magnified the problem.

Conclusion

User-randomized A/B tests have better metric sensitivity than tenant-randomized A/B tests. There are multiple approaches we can use to improve sensitivity for tenant-randomized A/B tests, and of these, common variance control techniques of metric capping and regression adjustment were observed to be most effective. However, tenant randomization can still result in limited metric power and we do not recommend widespread adoption of tenant-randomized A/B test. It should only be used when necessary, and experimenters should follow the best metric design practice and make sure to use variance-reduced metrics to be able to detect metric movements. It’s also wise to test only one variant at a time instead of testing multiple variants simultaneously to generate as large a sample size as possible for experiment analysis.

 

– Li Jiang, Tong Xia, Jen Townsend, Jing Jin, Harrison Siegel, Vivek Ramamurthy, Wen Qin, Microsoft Experimentation Platform

 

References

[1] W. Machmouchi, “Patterns of Trustworthy Experimentation: Pre-Experiment Stage,” 2020. [Online]. Available: https://www.microsoft.com/en-us/research/group/experimentation-platform-exp/articles/patterns-of-trustworthy-experimentation-pre-experiment-stage/

[2] S Liu, A Fabijan, M Furchtgott, S Gupta, P Janowski, W Qin, and P Dmitriev, “Enterprise-Level Controlled Experiments at Scale: Challenges and Solutions”, 2019 45th Euromicro Conference on Software Engineering and Advanced Applications (SEAA), DOI: 10.1109/SEAA.2019.00013

[3] M Saveski, J  Pouget-Abadie, G Saint-Jacques, W  Duan, S  Ghosh, Y Xu, and E Airoldi, “Detecting Network Effects: Randomizing Over Randomized Experiments”, KDD ’17: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 2017, Pages 1027–1035. DOI: 10.1145/3097983.3098192

[4] G. W. Imbens and D. B. Rubin, Causal Inference for Statistics, Social, and Biomedical Sciences. Cambridge University Press, 2015.

[5] Z. Zhao, M. Chen, D. Matheson and M. Stone, “Online Experimentation Diagnosis and Troubleshooting Beyond AA Validation”, 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA), DOI:10.1109/DSAA.2016.61

[6] A Deng, U Knoblich, and J Lu, “Applying the Delta Method in Metric Analytics: A Practical Guide with Novel Ideas”, KDD ’18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, July 2018,  Pages 233-242. DOI: 10.1145/ 3219819.3219919

[7] A Deng, Y Xu, R Kohavi, and T Walker, “Improving the sensitivity of online controlled experiments by utilizing pre-experiment data”, WSDM ’13: Proceedings of the sixth ACM international conference on Web search and data mining, February 2013, Pages 123–132. DOI: 10.1145/2433396.2433413