Monday, May 25, 2026

Published May 25, 2026 by with 0 comment

Benchmarking Common Pandas DataFrame Operations

Pandas DataFrames offer a ton of ways to do the same things and it isn't always obvious to me which option is best. I went ahead and profiled some common situations to compare them.

Sample DataFrame

For the first set of benchmarks, I made a dataframe with the following columns:

  • counter: ints incrementing from 1
  • chars: letters A-Z repeating over and over
  • strings: chars + counter
  • floats: random numbers between counter - 1 and counter + 1
First 5 rows look like this:


The dataframe for the first set of benchmarks has 10 million rows.


Finding rows using conditions

A super common operation is when you want to find all rows matching some conditions. For this dataframe that might be "give me rows with counter = '1'". Two common was to do that if the dataframe is stored in the variable df are:

  • rows = df[df['counter'] == 1]
  • rows = df.loc[df['counter'] == 1]
Which should you use? For the sample dataframe, the time to do 5 row lookups in my test environment (google colab) is:
  • rows = df[df['counter'] == 1] method = 16 ms
  • rows = df.loc[df['counter'] == 1] method = 16 ms

So no difference; these two are roughly identical. Why use .loc then?

.loc is faster, mainly when searching for a value in an indexed column in the dataframe. Imagine we run the following:

df.set_index(['counter'], inplace=True, drop=False)
Now we get the following three benchmarks for the lookups on counter:
methodruntime
rows = df[df['counter'] == 1]16ms
rows = df.loc[df['counter'] == 1]16 ms
rows = df.loc[1]0.046 ms (~3000x faster...)
The indexing only took 0.5 ms, so it can be well worth it. If you aren't familiar with that syntax, the .loc[1] there means 'find when the value of the index is 1' so it is still looking up by value and not location.

You might be wondering how .iloc fits in. For .iloc, you specify the row number so it only works when you know that. The other methods here are when you want to find a row based on the value of a column (or values of columns).

Finding rows by row index

We just noted .iloc. Is it the fastest option for getting the value at a specific row and column? Let's compare 3 common methods for that. Imagine we want the value of the 'chars' column from the 100th row (index 99 since 0-indexed):
methodruntime
val = df.iloc[99]['chars']0.05 ms
val = df.iat[99]['chars']0.004 ms
val = df['chars'].values[99]0.004 ms
.iloc is great when you want an entire row, but iat is a good, safe option when you want a specific value, and just indexing .values also works.

Finding the mean of a subset of a column

For the rest of the benchmarks, I switched to the first 1 million rows of the dataframe because some of the operations are much slower.

Operations across a column like mean or standard deviation are also really common. Two obvious ways to solve that are to iterate through each row and build an array to work with, or use something like .loc. When iterating through rows you'll likely see examples using iterrows and itertuples so I've benchmarked both. Finding the mean of the 'floats' column when the 'chars' column equals 'A' takes:
methodruntime
# iterate with iterrows
vals = []
for index, row in df.iterrows():
if row['chars'] == 'A':
vals.append(row['floats'])
result = np.mean(vals)
45 seconds
# iterate with itertuples
vals = []
for row in df.itertuples():
if row.chars == 'A':
vals.append(row.floats)
result = np.mean(vals)
1.1 seconds
# use .loc but without indexing
df.loc[df['chars'] == 'A']['floats'].mean()
0.083 seconds
# use .loc with indexing
df.loc[idx[:, 'A'], 'floats'].mean()
0.0026 seconds
As expected, iterating through the rows in with normal python for loops is much slower than using the dataframe operations, and indexing is much faster than not. However, I was surprised that itertuples is that much faster than iterrows. Iterrows creates a series object per row while itertuples creates a tuple per row, so I knew itertuples was faster, but would have guessed ~3x here instead of the actual ~40x.

Doing math on a column or combining columns

You'll probably come across .apply() for this, and it can be slow and in tricky ways. To benchmark this, we'll simply multiply our floats column by 2 using several different techniques:
methodruntime
vals = []
for row in test_df.itertuples():
vals.append(row.floats * 2)
test_df['new_floats'] = vals
0.99 s
test_df['new_floats'] = test_df.apply(lambda row: row['floats'] * 2, axis=1)5.04 s
test_df['new_floats'] = test_df['floats'].apply(lambda x: x * 2)0.29 s
test_df['new_floats'] = df['floats'].map(lambda x: x * 2)0.30 s
test_df['new_floats'] = df['floats'] * 20.005 s
To explain what's going on there:
  • itertuples is creating a tuple per row
  • .apply() with axis = 1 is creating a series object per row so most expensive
  • .apply() without the axis = 1 just works with the existing floats columns so is much faster, but still slow because it has to call the lambda once per row
  • just doing * 2 is a vectorized operation so has almost no overhead; it simply passes the blob to optimized code that operates on the whole thing


Simple Takeaways

  1. Use vectorized operations like built-in math operators (*, -, etc.) and .mean(), .sum(), etc. when possible.
  2. If you're regularly looking up on a field or set of fields in your dataframe, index it by the lookup field(s) and use .loc.
  3. If you have to loop through your dataframe, user itertuples instead of iterrows.
      edit

0 comments:

Post a Comment