Can You Confirm Performance Improvements With Noisy Software Benchmarks? ~ Random Problems

Say you run 20 tests before and after a code change meant to speed up the code, but there's a lot of noise in your benchmarks. Some simple statistical tests can help you determine if you actually have an improvement in that noise.

Sample Data

Imagine your 20 runs before and after look like this:

Before (ms)	After (ms)
241	272
224	211
202	226
243	234
246	205
229	279
209	208
231	212
258	218
287	198
270	215
262	244
227	215
200	175
291	220
290	218
184	218
319	247
250	245
229	199

In case you prefer histograms:

The 'after' numbers look like they're maybe smaller. If you take the average you get 245 ms before and 223 ms after. Is that really better though or are you just seeing noise?

T-Test

Assuming your benchmarking noise is roughly normally distributed, you can use a T-Test. If you have never seen a T-test, a really rough description is that it will take two groups of numbers, and tell you if the means of the two groups are significantly different (i.e., the difference between them probably isn't just noise).

What does 'probably' mean here? You get a p value out of T-Tests that is the probability that they're the same. E.g., a p value of 0.05 would mean roughly 'there's a 5% chance that the ~20 ms difference here is just noise'.

You can do this in excel, google sheets, any of the many websites that do it, etc. I tend to use Python for this sort of stuff so a simple overview of how to do it in Python is:

import stats from scipy
call the ttest_ind method in it with the before numbers as the first arg and the after numbers as the second
the t value returned should be positive (since before should be higher than after) and the p value should be 2*target probability

For the numbers in the example here, I get a p value of 0.03 which is less than the common target of 0.05, and recall earlier that I noted it's 2*target probability, so this is effectively a probability of 1.5% (p value of 0.015) which would generally mean 'significant difference'. Note that 'significant' here doesn't mean important...just unlikely to be noise. The difference in means is still the primary metric here.

To summarize this then, you could say that the update significantly altered the benchmark time, and the difference in means is ~20 ms (or a ~10% performance improvement).

Why divide by 2?

This is an artefact of the method you use. In this case, the method I gave for testing this tests both sides of the assumption (i.e., tests both before > after and before < after). We only care about the before > after side though. This method actually handles this for you in current versions but I have an older version installed and wanted to put the more generic.

Why ttest_ind?

There are a lot of variants of T-Tests you can run. It's worth reading through them but I won't rewrite tons of info on them here. The ttest_ind I used is for independent samples of data. You might argue that a paired one is better here since 'making code faster' is sort of changing one aspect of a thing and testing it again, but ttest_ind works well in general usage.

Mann-Whitney

What if you have outliers and/or do not have a normal distribution of noise in your benchmarks? For a concrete example, what if the first number in the 'after' column is 600 instead of 272? T-Tests are not valid in these situations. Running it blindly returns a p of 0.4 which would indicate not significantly different, all from that single bad outlier.

You can auto-exclude best and worst n times. You can manually scrub data. That sounds really manual though and we want to automate things. You can also use another type of test. One that's useful here is the Mann-Whitney U test.

The results are similar to a T-Test but the test itself is looking for something slightly different. Roughly, this test tells you how likely it is that the results are such that a random value chosen from after is just as likely to be greater than a random value chosen from before as vice-versa. Since it doesn't care about the magnitudes (only the orders), it is fine for outliers and non-normally distributed data.

Same basic flow in Python:

import stats from scipy
call the mannwhitneyu method in it with the before numbers as the first arg and the after numbers as the second; also pass in 'two-sided' as the alternative to be consistent with the T-Test above if you want
the p value should be 2*target probability

With the numbers here, I get a p value of 0.04, so dividing by 2, 0.02. This test was not tripped up by the outlier.

3 comments:

PiyaAugust 24, 2023 at 5:07 AM
Great blog post! Your writing style is engaging and easy to follow, making it a pleasure to read. Thanks for sharing valuable resources Besides this if you want to Know more about software testing read this The Role of Quality Assurance in Software Testing: Key Concepts and Training
ANewsJuly 17, 2025 at 12:14 AM
Ah, the magic of stumbling upon a space where thoughts dance freely and ideas spark like fireflies in the dusk! There’s something utterly refreshing about finding a corner of the internet that feels like a warm conversation over coffee—unscripted, genuine, and full of little surprises. Whether you’re here to ponder, learn, or simply wander through words, may this post leave you with that delightful itch to keep exploring. After all, the best ideas often hide in the pauses between sentences. 🌿 Keep your curiosity wild and your mind open—adventure lurks in the most unexpected places! dark market url
leebgregOctober 11, 2025 at 7:07 AM
The tapestry of ideas woven here is truly mesmerizing—each thread a vibrant hue of insight, creating a masterpiece that lingers long after the reading ends. Such artistry in expression deserves a standing ovation! 🎭 darknet market

Random Problems

Here are solutions to some random problems

Saturday, October 2, 2021

Can You Confirm Performance Improvements With Noisy Software Benchmarks?

Sample Data

T-Test

Mann-Whitney

3 comments:

Popular Posts

Categories

Search

Before (ms)	After (ms)
241	272
224	211
202	226
243	234
246	205
229	279
209	208
231	212
258	218
287	198
270	215
262	244
227	215
200	175
291	220
290	218
184	218
319	247
250	245
229	199

Before (ms)	After (ms)
241	272
224	211
202	226
243	234
246	205
229	279
209	208
231	212
258	218
287	198
270	215
262	244
227	215
200	175
291	220
290	218
184	218
319	247
250	245
229	199

Before (ms)	After (ms)
241	272
224	211
202	226
243	234
246	205
229	279
209	208
231	212
258	218
287	198
270	215
262	244
227	215
200	175
291	220
290	218
184	218
319	247
250	245
229	199