More Trustworthy A/B Analysis: Less Data Sampling and More Data Reducing

May 5, 2021

Share this page

We are all familiar with terabytes and petabytes. But have you heard about zettabytes (1000 petabytes)[1] (opens in new tab)? Worldwide data volume is expected to hit 163 zettabytes by 2025, 10 times the data in 2017. Your product will contribute to the surge, especially if it is growing rapidly. Therefore, you should be prepared to manage the increase in data [2] (opens in new tab).

The cost of storage and computation will spike as the data volume keeps increasing. Your data pipeline could even fail to process data if the computation request exceeds the capacity. To avoid these types of issues, you can reduce the data volume by collecting a portion of the data generated. But you need to answer several questions to ensure the data are collected in a trustworthy way: Are you mindful of its impact on the analysis for A/B tests? Do you still have valid and sensitive metrics? Are you confident the A/B analysis is still trustworthy so that you can make correct ship decisions?

Figure 1: Data flow of A/B analysis

In this blog post, we talk about:

how improper data reduction leads to untrustworthy analysis
practical design for data volume reduction which helps maintain trustworthy A/B analysis
3 recommended data reduction strategies based on our experience working with multiple partner teams within Microsoft

How does improper data reduction lead to untrustworthy analysis?

Figure 2: User and log matrix with absence of sampling. Cells in blue represent the data being collected.

We will use the user and log matrix shown in Figure 2 (with 20 users) as a general setup to discuss data reduction. Each row represents the data of an individual user, and each column represents the data of one event (e.g., page load time). In absence of sampling, all events are collected from all users.

Sampling, collecting only a portion of data generated from clients, is widely used to reduce data volume. You can collect a subset of users, events or other data depending on your specific requirements. An alternative method to sampling is summarization – transforming the raw data to the summary, which collects information from all users rather than a subset of users.

Figure 3: Simple Random Sampling

People usually use Simple Random Sampling [3] (opens in new tab) and Quota Sampling [4] (opens in new tab) as sampling techniques. Simple Random Sampling (in Figure 3) collects all data from only a subset of users. Quota Sampling (in Figure 4) segments the population into mutually exclusive sub-groups, and then selects the users from each sub-group based on a specified proportion. In the example of user-based segmentation, user A, B, C belong to one group with sample rate 66% while user D, E belong to another group with sample rate 50%. In the example of event-based segmentation, one group consisting of event type 1 and 2 is sampled at 10% while another group consisting of event type 3, 4 and 5 is sampled at 50%.

Figure 4: Two examples of Quota Sampling

Both sampling methods can impact the A/B test analysis. Simple Random Sampling can greatly hurt the power of metrics [5] (opens in new tab), especially the metrics covering rare events. For example, only new users generate First Run Experience data. The traffic size is low even when data are collected from all users. Applying Simple Random Sampling would further decrease the traffic size and quickly make related metrics under-powered. As a result, those metrics may miss detecting the feature impact on the behavior of new users. In Quota Sampling, events are collected differently and this difference can introduce potential bias to the metrics. Let’s assume 80% of the traffic in a product come from desktop and the rest are from mobile. You plan to sample desktop at 25% to reduce data volume but keep mobile data at 100% due to its lower traffic. The observed value of metric percentage of events from desktop will be 50% – a biased estimation of the ground truth 80%. Both Simple Random Sampling and Quota Sampling may also introduce sampling bias, that the collected samples are over- or under-represent the population. The analysis based on biased samples can lead to incorrect ship decisions which impact user experience and revenue.

Practical design for data volume reduction

We define the data reduction strategy as a plan for how to reduce the data volume and cost, and how to define A/B test metrics with STEDI (Sensitivity, Trustworthiness, Efficiency, Debuggability and Interpretability). Consider the following steps when you design a strategy.

Reconcile existing logging for importance and validity. Collect requirements from different teams and reconcile data for importance. You may not want to reduce data for critical (must-have) events, such as essential events to monitor the product health. Instead, consider starting from non-critical (good-to-have) events such as feature usage data.
Choose between sampling or summarization to reduce data. We will discuss more about the summarization in our first recommended strategy. Skip the rest steps below (except Assess impact on metrics) if you select summarization rather than sampling.
Choose the sampling unit. Let’s assume the feature exposure is randomized at user level. We recommend choosing user or enterprise [6] (opens in new tab) as the sampling unit, which is at or above the randomization unit. If a user can have multiple devices, sampling at device level will make it hard to correctly evaluate and interpret the results. For example, the usage pattern of users with multiple devices might be unequally represented by data collected from different devices, such as personal laptop and business workstation. Although the random assignment mechanism adopted by the A/B test could average out the variation when the sample size is big enough, it unnecessarily introduces another degree of complexity. Therefore, you should inspect the results carefully if the randomization unit and sampling unit don’t match.
Choose a sampling method. It is easy to implement Simple Random Sampling without the need to adjust metrics. In contrast, Quota Sampling is more flexible but may require metric adjustment. Select the method to best satisfy your business requirements while considering the impact on A/B metrics.
Estimate the sample rates considering several factors. (1) Overall data growth: Project it to the size increase of the sampled data and evaluate if that increase meets your requirements and limitations. (2) Metric power: The sample size should allow a key set of metrics to detect a significant feature impact. Also, if you usually run A/B tests drilling down to a specific subset of data (such as a particular market), then do the calculation accordingly. (3) Number of A/B tests and treatment groups in isolation: If you need 10 million users for treatment and control group each considering metric power, then supporting each additional treatment group will need additional 10 million users. Further, you will need at least additional 20 million users to support another A/B test running concurrently. (4) Special groups: You may apply different sample rates to groups like new users which are with relative low traffic.
Test sampling validity. Sampling bias occurs when samples over- or under-represent some characteristics of the A/B test population. The analysis based on biased samples may lead to incorrect ship decisions and impact user experience and revenue. To check the existence of bias, you can perform counterfactual sampling and evaluate the difference between the sampled and unsampled groups. We will explain how to do it in the next section.
Assess impact on STEDI of metrics. Make sure the metric definition is statistically correct, and metric is sensitive enough to catch a statistically significant feature impact [5] (opens in new tab). While designing the metric adjustment logic, consider not only the implementation complexity but also the maintenance and debugging cost.
Re-randomize to avoid residual effects. If A/B tests never cover the users in unsampled group, then user experience and usage patterns of those users will be different over time with respect to that of users which were exposed to many tests. Thus, the sampled set of users in a future test will no longer be representative of the entire population. This is due to the residual effects of A/B tests as mentioned in [7], [8]. To mitigate such effects, we recommend re-randomizing and re-selecting sample set regularly.

3 recommended data reduction strategies based on our learnings

The data reduction strategies can be quite different across product teams to meet their specific requirements. We recommend 3 data reduction strategies based on our learnings within Microsoft as well as considering the practical design steps mentioned above.

Summarization: Users send summary instead of detailed raw data. This strategy aligns best to the ideal scenario – no data reduction which allows to get most insights from analysis.
Full Analysis Population: All users send Critical Events, which are the events used to compute metrics for the evaluation of the key impact on the product. A fixed ratio of randomly selected users (we call it Full Analysis Population) send the complete set of events, with A/B tests and analysis primarily targeting those users.
Event Sampling Adjustment: All users send Critical Events. The rest of the events are sampled from a subset of users, with the sample rates differing based on the types of events. This strategy needs extra work to adjust metrics for sampling.

Now let’s dig into the details of each strategy.

1. Summarization

How does it work? In this strategy, data are collected from users by summarization rather than sampling. A client updates the summary locally based on user activity. Then it sends out the summary data periodically, such as at the end of every session. You should think about the following details during the design: (1) Define the logic of how to summarize data. Will the data be aggregated to event count or the sum of a value? How will the summary data look like in the log? You can see this guideline [9] (opens in new tab) for the best practices on using histograms for summarization. (2) Determine the frequency and condition to send out the data.

What are the pros? The strategy changes the data format but does not lose information required to compute metrics. Therefore, you only need to adjust metric definition to consume the new data format, without hurting the metric sensitivity.

What are the cons? It requires client code change and pipeline validation if the client has not enabled summarization yet. To validate the pipeline, you should compare the summarization pipeline with the original one to ensure the data consistency.

Additionally, Summarization strategy has some limitations. (1) If the summary data contains too much detailed information, the gain of this strategy will be less due to the limited decrease of the data size. (2) The impact of data loss will be great. If a record is lost, we will lose the information for the complete set of events contained in the summary. Sending duplicate records might help, but it will increase the data size. (3) It may limit the advanced analysis. Triggered analysis is a common advanced analysis drilling down to specific usage scenarios to isolate the real feature impact from noise [10]. Assume a search engine team decides to go with this approach. The clients send out the count of query X and query Y when a session ends. Due to the loss of the time information of the individual event, you cannot generate triggered analysis based on time-ordered events within the session. In this example, it is hard to drill down to the scenario with query for Y following query for X. You can still access data from all users with both queries and get unbiased analysis. But the analysis will have lower sensitivity as it includes users who queried in the opposite order.

2. Full Analysis Population

How does it work? The strategy categorizes all data into Critical and Non-Critical Events. Critical Events are the events used by Critical Metrics which evaluate the key feature impact on the product, such as guardrail metrics (e.g., reliability and performance) and success metrics of the product. Critical Events are collected from all users to maintain the sensitivity of Critical Metrics. Non-Critical Events are used by the rest of metrics with lower importance, such as feature related metrics and debug metrics.

We define a “Full Analysis Population” – a randomly selected subset of users who send back both Critical and Non-Critical events. An A/B test is run first on the Full Analysis Population only (stage 1 in Figure 5). You can perform a complete analysis, with both Critical and Non-Critical Metrics, to get the most insights and debug metric movements. This population should be big enough for most A/B tests to get metrics with acceptable sensitivity. In case a larger sample size is needed, then an additional A/B will run on all users (stage 2 in Figure 5). During analysis, you compute only Critical Metrics that are based on Critical Events coming from all users. When using this strategy, make sure you mark dependent events of Critical Metrics as Critical, so that all those events are collected no matter whether the A/B test runs on the Full Analysis Population or not.

Figure 5: Data sampling and A/B analysis for Full Analysis Population strategy. The data in red boxes are used for Critical Metrics and the data in yellow box are used for Non-Critical Metrics.

Full Analysis Population should be chosen at random and be representative of the entire population of users. You can use an A/B test as the counterfactual sampling to check the sampling bias. The method is to assign users into Full Analysis Population as the control group and the rest as the treatment group. Then analyze if any difference exists between the groups. Periodically, you should reset the randomization and re-select Full Analysis Population to mitigate any carry-over differences between the Full Analysis Population and the rest.

What are the pros? This strategy essentially creates Full Analysis Population by Simple Random Sampling (user A, B in Figure 5). You will run A/B tests on this population in a manner similar to the case where there is no sampling of data. Therefore, there is no need to make changes to metric definition at all.

What are the cons? Only one sample rate (for sampling Full Analysis Population) is used across Non-Critical event types, which may uniformly impact sensitivity across metrics. For example, the page load time events usually come in with high volume, feature teams usually wish to sample it at a lower rate to maximize the data cost savings, whereas other events down at the usage funnel (e.g., Ads click event) are fired less frequently and should be sampled at a higher rate. But this strategy can’t meet the requirement.

The size of Full Analysis Population can greatly impact the metric power if it is small. In that case, you can proceed to stage 2, with the analysis limited to Critical Events only. Due to increase in traffic, the Critical Metrics will have better sensitivity, but you can do nothing to improve the Non-Critical Metrics.

Refer to Event Sampling Adjustment below to address the concerns about the flexibility and metric sensitivity in this strategy.

3. Event Sampling Adjustment

How does it work? Similar to Full Analysis Population strategy, all users send Critical Events so that Critical Metrics are always with the highest sensitivity. But the strategy does not need a Full Analysis Population. Non-Critical Events are collected from a subset of users. They can be at different rates but should always use the same randomization seed. You can see an example in Figure 6. Non-Critical Event type 3 is sampled by 15% users and event type 4, 5 are sampled by 10%. The 10% of the population – user A and B – overlap in these two cases and send all those events. Like what was mentioned in the previous section, it is necessary to check the sampling bias for each sample rate used.

Figure 6: Data sampling and A/B analysis for Event Sampling Adjustment strategy. The data in red box are used for Critical Metrics and the data in yellow boxes are used for Non-Critical Metrics.

You can use adaptive user base when computing Non-Critical metrics. The logic is to look for sample rates of all underlying events, and drill down to the greatest eligible set of users who send all those events. In Figure 6, if a metric is dependent on event type 1, 2 and 3, an unbiased estimate is to analyze data from users A, B, C. If a metric covers event type 4, 5, the user base will be adjusted to users A and B.

What are the pros? The strategy allows to sample Non-Critical Events at different rates, which is more flexible compared with Full Analysis Population strategy. The adaptive user base for Non-Critical metrics maximizes the metric power by making full use of the collected data. And the data used for computing each metric are not sampled at all, so that you don’t need to further adjust metric definition.

What are the cons? The logic to adjust user base in metric definition might be complicated. And we don’t recommend hard coding the event sample rate in the metric definition, because that will increase the maintenance efforts when the sample rates change. Regarding analysis, different user bases across metrics can make it hard to debug and interpret analysis results. For example, it is not straight-forward to tell the greatest eligible set of users for metrics using multiple events. Further, different metrics will have different greatest eligible set of users. This makes it harder to correlate and debug metric movements.

Key takeaways

We summarize how the data and A/B analysis of the three recommended strategies look like in Figure 7.

Figure 7: Data sampling and A/B analysis for the three recommended data reduction strategies. The data in red boxes are used for Critical Metrics and the data in yellow boxes are used for Non-Critical Metrics.

In Table 1, we put together the pros and cons of the recommendations, ranked with the best ones on top. When you look for a pre-defined data reduction strategy, start from the top one and move to the next only if there is any limitation.

Strategy	Pros	Cons
Summarization – Collect summary data instead of detailed raw data.	– Little information loss. – No impact on metric sensitivity.	– Need to change client code, validate the pipeline, and create metrics consuming new data format. – The data size may be still large. – The impact of data loss can be great. – May have limitations with advanced analysis.
Full Analysis Population – Collect Critical Events from all users and Non-Critical Events from Full Analysis Population. Re-randomize and re-select Full Analysis Population regularly. – At first, run all A/B tests targeting Full Analysis Population and perform complete analysis. If the sample size is not large enough, run A/B tests targeting all users instead, and only analyze Critical Metrics for safe rollout.	– No information loss for Critical Metrics. – No need to adjust metric definition.	– Can greatly impact metric sensitivity if the size of the sample size for Full Analysis is small. – Non-Critical Events are sampled at fixed rate. – When the second stage is needed, it takes time to run the additional A/B test.
Event Sampling Adjustment – Collect Critical Events from all users. Collect Non-Critical Events from the subset of users, with various sample rates. Re-randomize regularly. – During analysis, adapt the user base for metrics to maximum leverage the collected data.	– No information loss for Critical Metrics. – Allow Non-Critical Events to get sampled at different rates.	– The logic to adjust user base in metric definition might be complicated and maintenance high. – Can be hard to debug and interpret analysis results.

Summary

Data reduction may impact the trustworthiness of A/B analysis. Therefore, if you need to reduce data volume, it is important to be familiar with the impact on A/B analysis results. In this blog post, we (1) discussed the steps to set yourself up for success when designing data reduction, and (2) presented 3 recommended data reduction strategies with their pros and cons. We hope it will help you design a practical data reduction strategy to meet the product requirements while keeping A/B analysis trustworthy.

– Wen Qin, Somit Gupta and Jing Jin, Microsoft Experimentation Platform

References

[1] “Big Data Growth Statistics to Blow Your Mind (or, What is a Yottabyte Anyway?).” https://www.aparavi.com/data-growth-statistics-blow-your-mind/.

[2] “Enormous Growth in Data is Coming — How to Prepare for It, and Prosper From It,” https://blog.seagate.com/business/enormous-growth-in-data-is-coming-how-to-prepare-for-it-and-prosper-from-it/.

[3] “Simple random sample.” https://en.wikipedia.org/wiki/Simple_random_sample.

[4] “Quota sampling.” https://en.wikipedia.org/wiki/Quota_sampling.

[5] W. Qin, W. Machmouchi, and A. M. Martins, “Beyond Power Analysis: Metric Sensitivity in A/B Tests.” https://www.microsoft.com/en-us/research/group/experimentation-platform-exp/articles/beyond-power-analysis-metric-sensitivity-in-a-b-tests/.

[6] J. Li et al., “Why Tenant-Randomized A/B Test is Challenging and Tenant-Pairing May Not Work.” https://www.microsoft.com/en-us/research/group/experimentation-platform-exp/articles/why-tenant-randomized-a-b-test-is-challenging-and-tenant-pairing-may-not-work/.

[7] R. Kohavi, R. Longbotham, D. Sommerfield, and R. M. Henne, “Controlled experiments on the web: survey and practical guide,” Data Mining and Knowledge Discovery, vol. 18, no. 1, pp. 140–181, Feb. 2009, doi: 10.1007/s10618-008-0114-1.

[8] R. Kohavi, A. Deng, B. Frasca, R. Longbotham, T. Walker, and Y. Xu, “Trustworthy Online Controlled Experiments: Five Puzzling Outcomes Explained,” p. 9, 2012.

[9] “Histogram Guidelines.” https://chromium.googlesource.com/chromium/src.git/+/HEAD/tools/metrics/histograms/README.md#Enum-Histograms.

[10] N. Chen, M. Liu, and Y. Xu, “Five Insightful Discoveries Towards Building Intelligent A/B Testing Platforms,” KDD 2017, p. 9, 2017.