Pandas Performance Optimization Tips
Essential optimization techniques for faster data processing with pandas DataFrame operations.
Pandas Performance Optimization Tips
Working with large datasets in pandas can be challenging when performance becomes a bottleneck. Here are essential optimization techniques that can dramatically improve your data processing speed.
1. Use Vectorized Operations
Instead of iterating through DataFrame rows, leverage pandasβ vectorized operations:
# Slow - iterating through rows
for index, row in df.iterrows():
df.at[index, 'new_col'] = row['col1'] * row['col2']
# Fast - vectorized operation
df['new_col'] = df['col1'] * df['col2']
2. Optimize Data Types
Choose appropriate data types to reduce memory usage:
# Check memory usage
print(df.memory_usage(deep=True))
# Optimize numeric types
df['int_col'] = df['int_col'].astype('int32') # instead of int64
df['float_col'] = df['float_col'].astype('float32') # instead of float64
# Use categories for strings with few unique values
df['category_col'] = df['category_col'].astype('category')
3. Use query() for Filtering
The query()
method can be faster than boolean indexing:
# Standard filtering
result = df[(df['col1'] > 100) & (df['col2'] == 'value')]
# Faster with query
result = df.query('col1 > 100 and col2 == "value"')
4. Leverage eval() for Complex Expressions
For complex mathematical operations, eval()
can provide significant speedups:
# Standard approach
df['result'] = df['a'] + df['b'] * df['c'] - df['d']
# Faster with eval
df['result'] = df.eval('a + b * c - d')
5. Read Data Efficiently
Optimize data reading with appropriate parameters:
# Specify data types upfront
dtypes = {'col1': 'int32', 'col2': 'float32', 'col3': 'category'}
df = pd.read_csv('file.csv', dtype=dtypes)
# Use chunksize for large files
chunk_size = 10000
for chunk in pd.read_csv('large_file.csv', chunksize=chunk_size):
process_chunk(chunk)
These techniques can provide 2-10x performance improvements depending on your use case. Always profile your code to identify the actual bottlenecks in your specific workflow.
Alexander Nykolaiszyn
Manager Business Insights at Lennar | Host of Trailblazer Analytics Podcast | 15+ years transforming raw data into strategic business value through BI, automation, and AI integrations.