**Experimentation in big companies like ****Adevinta**** can be challenging from a technical point of view. **We receive billions of events daily from our digital marketplaces in 10 different countries, and we need to measure the significance of slight differences in multiple metrics and experiments. In the Houston team, we always evaluate improved ways of performing this calculation.

Results of A/B tests are typically calculated in batch mode. The process usually involves calculating aggregate values at the *user* and *group* levels.

Let’s imagine an experiment trying to improve the number of advertisement views; we are on the 14th day of the experiment. The user-level aggregate values would look like:

And the group-level aggregation would look something like this:

The purpose of these group-level aggregations is to decide later if that difference in the mean is significant (using the Bayesian or frequentist approach).

We also want to track the evolution of those metrics throughout time. Not to make decisions (that would incur the ‘*peeking problem’* of A/B testing), but to be sure we didn’t break anything (after all, we’re trying out a new feature). So we want to regularly examine some guardrails metrics, such as ‘*mean revenue per user*’.

Let’s imagine we are now on the 15th day and need to update the accumulated results from day 14. We calculate the same aggregations for the last 24 hours:

These last 24-hour results cannot be merged with the previously accumulated results because the populations are shared (user *Joe* could be counted twice, even though it’s the same user). Mean and *stdev* aggregated values can only be merged if the populations are separated.

So, most A/B testing platforms usually calculate the results again from the very beginning of every results update. This has an exponential cost: we must repeatedly read and aggregate the same billions of events every day.

One possible solution is maintaining an intermediate state: a table with the user-level aggregations per each experiment. Then, on the 15th day, we would update that user-level table with the new events received in the last 24 hours and re-calculate the group-level aggregations. While this would be perfectly doable, it forces us to develop code to maintain and update an internal state and develop solutions for all possible conditions, like backfills and missed runs.

We could approach the problem differently: what if we had a job that starts and ends with the experiment and has no dependencies on other job runs?

Enter streaming mode.

## Streaming mode (first iteration)

We start by partitioning events by the user identifier so each task will handle a different, distinct set of users. Each node can maintain its user-level aggregations easily, as this is one of the main features of any streaming framework (in our case, we used Kafka streams):

```
exposedUsers.join(adviewEvents, (e,t) -> e.user_id = t.user_id)
.groupBy(userId).count()
```

Then, every x minutes, we can iterate over those user-level aggregations and calculate the overall mean and standard deviation in a single pass using Welford’s algorithm, similar to how it is done in Spark.

However, streaming applications should refrain from blocking by doing synchronous calls or iterating over large data sets, as this might stop data processing momentarily and make the application less predictable and reliable.

Can we do better?

## Streaming mode (second iteration)

Every time an aggregation for a user is updated, we want the overall mean and standard deviation for its group to be updated, too. We can’t use Welford’s algorithm directly because if we aggregated the following sequence of updates from the streaming query above:

We would count two samples for the same real user (adding a user with 14 and later adding a user with a value of 15). Luckily, there’s a weighted version of Welford’s algorithm where you can add samples (*weight = 1*) and remove samples (*weight = **–**1*). By removing the old value and adding the new one, we can effectively update the overall stats for the group without iterating over any data set (*D.H.D. West*).

Finally, we need to introduce a new concept: Numerical stability. It’s well known that CPUs cannot store numbers in their exact value. For example, 2/3 is stored as *0.666666666*…, depending on the precision you’re working with. But there’s always a small loss. Usually, it’s so small that it doesn’t have any effect in real life. For instance, in interplanetary navigation, NASA only uses 15 digits of pi. However, some algorithms can incur ‘*catastrophic cancellation*’ by repeatedly subtracting similar approximate values. This phenomenon is called ‘*numerical instability*’. An example of an algorithm that suffers from numerical instability is the ‘*sum of squares*’ method for estimating the variance: *Var[X] = E[X*X]-E[X]•E[X] :*

```
import random
import math
if __name__ == '__main__':
# Generate samples
random.seed(0)
samples = [random.gauss(5000, .001) for _ in range(1000000)]
# Single-pass method
mean = sum(samples) / len(samples)
variance = sum([x**2 for x in samples]) / len(samples) - mean**2
std_dev_single_pass = math.sqrt(variance)
# Welford's algorithm
mean_welford = 0.0
m2 = 0.0
for i, x in enumerate(samples, start=1):
delta = x - mean_welford
mean_welford += delta / i
delta2 = x - mean_welford
m2 += delta * delta2
variance_welford = m2 / (len(samples) - 1)
std_dev_welford = math.sqrt(variance_welford)
print(f"stdev (single-pass): {std_dev_single_pass}")
print(f"stdev (Welford): {std_dev_welford}")
Variance (single-pass): 0.0005758045124546267
Variance (Welford): 0.000999940233623677
```

Welford’s algorithm is numerically stable, but its weighted version is not. Nevertheless, the numerical stability is recovered by switching the two accumulators of a weighted Welford’s algorithm from 64 to 128 bits.

## Conclusions

At Adevinta, we developed a streaming A/B analysis solution that can scale to support a large number of experiments, each one including millions of distinct users. We only need to look at each event once, regardless of how many experiments or users we have or how many result updates we want. We can also detect any problem in real time when doing feature rollouts without waiting hours for a batch process to complete.

This post discussed calculating aggregated values to analyse an A/B test result. **If you want to know more about how this analysis is done, don’t miss this other ****excellent post****.**