Sample DataFrame
For the first set of benchmarks, I made a dataframe with the following columns:
- counter: ints incrementing from 1
- chars: letters A-Z repeating over and over
- strings: chars + counter
- floats: random numbers between counter - 1 and counter + 1
Finding rows using conditions
A super common operation is when you want to find all rows matching some conditions. For this dataframe that might be "give me rows with counter = '1'". Two common was to do that if the dataframe is stored in the variable df are:
- rows = df[df['counter'] == 1]
- rows = df.loc[df['counter'] == 1]
- rows = df[df['counter'] == 1] method = 16 ms
- rows = df.loc[df['counter'] == 1] method = 16 ms
So no difference; these two are roughly identical. Why use .loc then?
.loc is faster, mainly when searching for a value in an indexed column in the dataframe. Imagine we run the following:
df.set_index(['counter'], inplace=True, drop=False)
Now we get the following three benchmarks for the lookups on counter:
| method | runtime |
|---|---|
| rows = df[df['counter'] == 1] | 16ms |
| rows = df.loc[df['counter'] == 1] | 16 ms |
| rows = df.loc[1] | 0.046 ms (~3000x faster...) |
You might be wondering how .iloc fits in. For .iloc, you specify the row number so it only works when you know that. The other methods here are when you want to find a row based on the value of a column (or values of columns).
Finding rows by row index
We just noted .iloc. Is it the fastest option for getting the value at a specific row and column? Let's compare 3 common methods for that. Imagine we want the value of the 'chars' column from the 100th row (index 99 since 0-indexed):| method | runtime |
|---|---|
| val = df.iloc[99]['chars'] | 0.05 ms |
| val = df.iat[99]['chars'] | 0.004 ms |
| val = df['chars'].values[99] | 0.004 ms |
Finding the mean of a subset of a column
For the rest of the benchmarks, I switched to the first 1 million rows of the dataframe because some of the operations are much slower.Operations across a column like mean or standard deviation are also really common. Two obvious ways to solve that are to iterate through each row and build an array to work with, or use something like .loc. When iterating through rows you'll likely see examples using iterrows and itertuples so I've benchmarked both. Finding the mean of the 'floats' column when the 'chars' column equals 'A' takes:
| method | runtime |
|---|---|
| # iterate with iterrows vals = [] for index, row in df.iterrows(): if row['chars'] == 'A': vals.append(row['floats']) result = np.mean(vals) | 45 seconds |
| # iterate with itertuples vals = [] for row in df.itertuples(): if row.chars == 'A': vals.append(row.floats) result = np.mean(vals) | 1.1 seconds |
| # use .loc but without indexing df.loc[df['chars'] == 'A']['floats'].mean() | 0.083 seconds |
| # use .loc with indexing df.loc[idx[:, 'A'], 'floats'].mean() | 0.0026 seconds |
Doing math on a column or combining columns
You'll probably come across .apply() for this, and it can be slow and in tricky ways. To benchmark this, we'll simply multiply our floats column by 2 using several different techniques:| method | runtime |
|---|---|
| vals = [] for row in test_df.itertuples(): vals.append(row.floats * 2) test_df['new_floats'] = vals | 0.99 s |
| test_df['new_floats'] = test_df.apply(lambda row: row['floats'] * 2, axis=1) | 5.04 s |
| test_df['new_floats'] = test_df['floats'].apply(lambda x: x * 2) | 0.29 s |
| test_df['new_floats'] = df['floats'].map(lambda x: x * 2) | 0.30 s |
| test_df['new_floats'] = df['floats'] * 2 | 0.005 s |
- itertuples is creating a tuple per row
- .apply() with axis = 1 is creating a series object per row so most expensive
- .apply() without the axis = 1 just works with the existing floats columns so is much faster, but still slow because it has to call the lambda once per row
- just doing * 2 is a vectorized operation so has almost no overhead; it simply passes the blob to optimized code that operates on the whole thing
Simple Takeaways
- Use vectorized operations like built-in math operators (*, -, etc.) and .mean(), .sum(), etc. when possible.
- If you're regularly looking up on a field or set of fields in your dataframe, index it by the lookup field(s) and use .loc.
- If you have to loop through your dataframe, user itertuples instead of iterrows.

0 comments:
Post a Comment