## Saturday, October 2, 2021

Published October 02, 2021 by with 0 comment

# Can You Confirm Performance Improvements With Noisy Software Benchmarks?

Say you run 20 tests before and after a code change meant to speed up the code, but there's a lot of noise in your benchmarks. Some simple statistical tests can help you determine if you actually have an improvement in that noise.

#### Sample Data

Imagine your 20 runs before and after look like this:

Before (ms)After (ms)
241272
224211
202226
243234
246205
229279
209208
231212
258218
287198
270215
262244
227215
200175
291220
290218
184218
319247
250245
229199

In case you prefer histograms:

The 'after' numbers look like they're maybe smaller. If you take the average you get 245 ms before and 223 ms after. Is that really better though or are you just seeing noise?

#### T-Test

Assuming your benchmarking noise is roughly normally distributed, you can use a T-Test. If you have never seen a T-test, a really rough description is that it will take two groups of numbers, and tell you if the means of the two groups are significantly different (i.e., the difference between them probably isn't just noise).

What does 'probably' mean here? You get a p value out of T-Tests that is the probability that they're the same. E.g., a p value of 0.05 would mean roughly 'there's a 5% chance that the ~20 ms difference here is just noise'.

You can do this in excel, google sheets, any of the many websites that do it, etc. I tend to use Python for this sort of stuff so a simple overview of how to do it in Python is:
• import stats from scipy
• call the ttest_ind method in it with the before numbers as the first arg and the after numbers as the second
• the t value returned should be positive (since before should be higher than after) and the p value should be 2*target probability
For the numbers in the example here, I get a p value of 0.03 which is less than the common target of 0.05, and recall earlier that I noted it's 2*target probability, so this is effectively a probability of 1.5% (p value of 0.015) which would generally mean 'significant difference'. Note that 'significant' here doesn't mean important...just unlikely to be noise. The difference in means is still the primary metric here.

To summarize this then, you could say that the update significantly altered the benchmark time, and the difference in means is ~20 ms (or a ~10% performance improvement).

Why divide by 2?

This is an artefact of the method you use. In this case, the method I gave for testing this tests both sides of the assumption (i.e., tests both before > after and before < after). We only care about the before > after side though. This method actually handles this for you in current versions but I have an older version installed and wanted to put the more generic.

Why ttest_ind?

There are a lot of variants of T-Tests you can run. It's worth reading through them but I won't rewrite tons of info on them here. The ttest_ind I used is for independent samples of data. You might argue that a paired one is better here since 'making code faster' is sort of changing one aspect of a thing and testing it again, but ttest_ind works well in general usage.

#### Mann-Whitney

What if you have outliers and/or do not have a normal distribution of noise in your benchmarks? For a concrete example, what if the first number in the 'after' column is 600 instead of 272? T-Tests are not valid in these situations. Running it blindly returns a p of 0.4 which would indicate not significantly different, all from that single bad outlier.

You can auto-exclude best and worst n times. You can manually scrub data. That sounds really manual though and we want to automate things. You can also use another type of test. One that's useful here is the Mann-Whitney U test.

The results are similar to a T-Test but the test itself is looking for something slightly different. Roughly, this test tells you how likely it is that the results are such that a random value chosen from after is just as likely to be greater than a random value chosen from before as vice-versa. Since it doesn't care about the magnitudes (only the orders), it is fine for outliers and non-normally distributed data.

Same basic flow in Python:
• import stats from scipy
• call the mannwhitneyu method in it with the before numbers as the first arg and the after numbers as the second; also pass in 'two-sided' as the alternative to be consistent with the T-Test above if you want
• the p value should be 2*target probability
With the numbers here, I get a p value of 0.04, so dividing by 2, 0.02. This test was not tripped up by the outlier.