In today’s data-driven world, choosing the right tool for processing datasets is crucial. Every platform claims to be the best, but which one truly delivers when the pressure is on?
This article dives deep into a head-to-head comparison of three popular data processing tools: DuckDB, SQLite, and Pandas. We subjected them to everyday analytical tasks on a dataset containing over a million rows. Our goal? To determine which tool offers the best balance of speed and memory efficiency for real-world data analysis.
The Challenge: Real-World Data Analysis
Unlike artificial benchmarks designed to showcase specific strengths, we focused on tasks that data analysts perform daily:
- Summing Values: Calculating the total of a numerical column.
- Grouping by Categories: Aggregating data based on categorical variables.
- Filtering with Conditions: Selecting data based on specific criteria before aggregation.
- Multi-Field Aggregations: Combining multiple grouping and aggregation operations.
We put them through their paces using a dataset large enough to reveal performance differences, but small enough to be workable on a single machine.
Meet the Contenders: DuckDB, SQLite, and Pandas
Let’s take a closer look at the tools in this comparison:
- DuckDB: A high-performance analytical database system designed for speed and efficiency. It integrates seamlessly with Python and Pandas, allowing you to query DataFrames directly using SQL.
- SQLite: A lightweight, file-based database engine, ideal for embedded applications and small to medium-sized datasets. It’s known for its simplicity and ease of use.
- Pandas: A powerful Python library for data manipulation and analysis. It provides flexible data structures like DataFrames and Series, along with a wide range of functions for cleaning, transforming, and analyzing data.
The Million-Row Dataset: Our Battleground
To ensure a realistic test, we used the "Bank dataset" from Kaggle, consisting of over 1 million rows and five columns:
- Date: The date of the transaction.
- Domain: The type of business (e.g., RETAIL, RESTAURANT).
- Location: The geographical region (e.g., Goa, Mathura).
- Value: The transaction value.
- Transaction_count: The number of transactions on that day.
While this dataset is synthetically generated, its size and structure are suitable for highlighting the performance differences between the tools. Here’s how we loaded and peeked at the data using Pandas:
import pandas as pd
df = pd.read_excel('bankdataset.xlsx')
print("Dataset shape:", df.shape)
df.head()
This gave us a quick overview of the data’s dimensions and structure, ensuring everything was loaded correctly.
Setting the Stage: A Fair and Consistent Environment
To guarantee a level playing field, we ran all three tools (DuckDB, SQLite, and Pandas) within the same Jupyter Notebook environment. This ensured consistent runtime conditions and memory usage.
Installation and Setup
First, we installed and imported the necessary Python packages:
!pip install duckdb --quiet
import pandas as pd
import duckdb
import sqlite3
import time
from memory_profiler import memory_usage
Data Preparation
We used Pandas to load the dataset once, then shared or registered it with DuckDB and SQLite.
Loading Data into Pandas:
df = pd.read_excel('bankdataset.xlsx')
df.head()
Registering Data with DuckDB:
DuckDB can directly access Pandas DataFrames. We registered the DataFrame as a DuckDB table, allowing us to query it using SQL:
duckdb.register("bank_data", df)
duckdb.query("SELECT * FROM bank_data LIMIT 5").to_df()
Preparing Data for SQLite:
Since SQLite doesn’t directly read Excel files, we loaded the Pandas DataFrame into an in-memory SQLite database:
conn_sqlite = sqlite3.connect(":memory:")
df.to_sql("bank_data", conn_sqlite, index=False, if_exists='replace')
pd.read_sql_query("SELECT * FROM bank_data LIMIT 5", conn_sqlite)
Benchmarking Methodology
We used the same four analytical queries on each tool, simulating common data analysis tasks.
Ensuring Consistent Setup:
- Pandas queried the DataFrame directly.
- DuckDB executed SQL queries against the registered DataFrame.
- SQLite ran SQL queries on a copy of the DataFrame stored in an in-memory database.
Measuring Execution Time:
We used Python’s time module to measure the duration of each query, excluding data loading and preparation steps.
Tracking Memory Usage:
We used the memory_profiler library to track memory usage before and after each query, estimating incremental RAM consumption.
The Benchmark Queries: Putting the Tools to the Test
We ran each tool through the same four essential data analysis tasks:
- Query 1: Total Transaction Value: Summing the ‘Value’ column.
- Query 2: Group by Domain: Aggregating transaction counts per ‘Domain’.
- Query 3: Filter by Location: Filtering rows where ‘Location’ is ‘Goa’ before aggregating.
- Query 4: Group by Domain & Location: Multi-field aggregation calculating the average ‘Value’ grouped by ‘Domain’ and ‘Location’.
The Results: Unveiling the Performance Champions
Let’s dive into the performance of each tool for each query.
Query 1: Total Transaction Value
This query tests the speed of summing a numeric column across the entire dataset.
Pandas Performance:
pandas_results = []
def pandas_q1():
return df['Value'].sum()
mem_before = memory_usage(-1)[0]
start = time.time()
pandas_q1()
end = time.time()
mem_after = memory_usage(-1)[0]
pandas_results.append({
"engine": "Pandas",
"query": "Total transaction value",
"time": round(end - start, 4),
"memory": round(mem_after - mem_before, 4)
})
pandas_results
DuckDB Performance:
duckdb_results = []
def duckdb_q1():
return duckdb.query("SELECT SUM(value) FROM bank_data").to_df()
mem_before = memory_usage(-1)[0]
start = time.time()
duckdb_q1()
end = time.time()
mem_after = memory_usage(-1)[0]
duckdb_results.append({
"engine": "DuckDB",
"query": "Total transaction value",
"time": round(end - start, 4),
"memory": round(mem_after - mem_before, 4)
})
duckdb_results
SQLite Performance:
sqlite_results = []
def sqlite_q1():
return pd.read_sql_query("SELECT SUM(value) FROM bank_data", conn_sqlite)
mem_before = memory_usage(-1)[0]
start = time.time()
sqlite_q1()
end = time.time()
mem_after = memory_usage(-1)[0]
sqlite_results.append({
"engine": "SQLite",
"query": "Total transaction value",
"time": round(end - start, 4),
"memory": round(mem_after - mem_before, 4)
})
sqlite_results
Overall Performance Analysis:
Pandas emerged as the clear winner, completing the query almost instantly with minimal memory usage. DuckDB was slightly slower and used more memory, while SQLite was the slowest and most memory-intensive.
Query 2: Group by Domain
This query measures the performance of grouping transactions by ‘Domain’ and summing their counts.
Pandas Performance:
def pandas_q2():
return df.groupby('Domain')['Transaction_count'].sum()
mem_before = memory_usage(-1)[0]
start = time.time()
pandas_q2()
end = time.time()
mem_after = memory_usage(-1)[0]
pandas_results.append({
"engine": "Pandas",
"query": "Group by domain",
"time": round(end - start, 4),
"memory": round(mem_after - mem_before, 4)
})
[p for p in pandas_results if p["query"] == "Group by domain"]
DuckDB Performance:
def duckdb_q2():
return duckdb.query("""
SELECT domain, SUM(transaction_count)
FROM bank_data
GROUP BY domain
""").to_df()
mem_before = memory_usage(-1)[0]
start = time.time()
duckdb_q2()
end = time.time()
mem_after = memory_usage(-1)[0]
duckdb_results.append({
"engine": "DuckDB",
"query": "Group by domain",
"time": round(end - start, 4),
"memory": round(mem_after - mem_before, 4)
})
[p for p in duckdb_results if p["query"] == "Group by domain"]
SQLite Performance:
def sqlite_q2():
return pd.read_sql_query("""
SELECT domain, SUM(transaction_count) AS total_txn
FROM bank_data
GROUP BY domain
""", conn_sqlite)
mem_before = memory_usage(-1)[0]
start = time.time()
sqlite_q2()
end = time.time()
mem_after = memory_usage(-1)[0]
sqlite_results.append({
"engine": "SQLite",
"query": "Group by domain",
"time": round(end - start, 4),
"memory": round(mem_after - mem_before, 4)
})
[p for p in sqlite_results if p["query"] == "Group by domain"]
Overall Performance Analysis:
DuckDB took the lead in this round, followed by Pandas, which traded a bit more time for lower memory consumption. SQLite remained the slowest and most memory-intensive.
Query 3: Filter by Location (Goa)
This query measures the performance of filtering rows based on a condition (‘Location’ = ‘Goa’) and then summing the transaction values.
Pandas Performance:
def pandas_q3():
return df[df['Location'] == 'Goa']['Value'].sum()
mem_before = memory_usage(-1)[0]
start = time.time()
pandas_q3()
end = time.time()
mem_after = memory_usage(-1)[0]
pandas_results.append({
"engine": "Pandas",
"query": "Filter by location",
"time": round(end - start, 4),
"memory": round(mem_after - mem_before, 4)
})
[p for p in pandas_results if p["query"] == "Filter by location"]
DuckDB Performance:
def duckdb_q3():
return duckdb.query("""
SELECT SUM(value)
FROM bank_data
WHERE location = 'Goa'
""").to_df()
mem_before = memory_usage(-1)[0]
start = time.time()
duckdb_q3()
end = time.time()
mem_after = memory_usage(-1)[0]
duckdb_results.append({
"engine": "DuckDB",
"query": "Filter by location",
"time": round(end - start, 4),
"memory": round(mem_after - mem_before, 4)
})
[p for p in duckdb_results if p["query"] == "Filter by location"]
SQLite Performance:
def sqlite_q3():
return pd.read_sql_query("""
SELECT SUM(value) AS total_value
FROM bank_data
WHERE location = 'Goa'
""", conn_sqlite)
mem_before = memory_usage(-1)[0]
start = time.time()
sqlite_q3()
end = time.time()
mem_after = memory_usage(-1)[0]
sqlite_results.append({
"engine": "SQLite",
"query": "Filter by location",
"time": round(end - start, 4),
"memory": round(mem_after - mem_before, 4)
})
[p for p in sqlite_results if p["query"] == "Filter by location"]
Overall Performance Analysis:
DuckDB once again emerged as the fastest and most memory-efficient, while Pandas was slower and required more memory. SQLite was the slowest but lighter on memory than Pandas.
Query 4: Group by Domain & Location
This query tests the performance of multi-field aggregation, calculating the average ‘Value’ grouped by both ‘Domain’ and ‘Location’.
Pandas Performance:
def pandas_q4():
return df.groupby(['Domain', 'Location'])['Value'].mean()
mem_before = memory_usage(-1)[0]
start = time.time()
pandas_q4()
end = time.time()
mem_after = memory_usage(-1)[0]
pandas_results.append({
"engine": "Pandas",
"query": "Group by domain & location",
"time": round(end - start, 4),
"memory": round(mem_after - mem_before, 4)
})
[p for p in pandas_results if p["query"] == "Group by domain & location"]
DuckDB Performance:
def duckdb_q4():
return duckdb.query("""
SELECT domain, location, AVG(value) AS avg_value
FROM bank_data
GROUP BY domain, location
""").to_df()
mem_before = memory_usage(-1)[0]
start = time.time()
duckdb_q4()
end = time.time()
mem_after = memory_usage(-1)[0]
duckdb_results.append({
"engine": "DuckDB",
"query": "Group by domain & location",
"time": round(end - start, 4),
"memory": round(mem_after - mem_before, 4)
})
[p for p in duckdb_results if p["query"] == "Group by domain & location"]
SQLite Performance:
def sqlite_q4():
return pd.read_sql_query("""
SELECT domain, location, AVG(value) AS avg_value
FROM bank_data
GROUP BY domain, location
""", conn_sqlite)
mem_before = memory_usage(-1)[0]
start = time.time()
sqlite_q4()
end = time.time()
mem_after = memory_usage(-1)[0]
sqlite_results.append({
"engine": "SQLite",
"query": "Group by domain & location",
"time": round(end - start, 4),
"memory": round(mem_after - mem_before, 4)
})
[p for p in sqlite_results if p["query"] == "Group by domain & location"]
Overall Performance Analysis:
DuckDB handled this complex query the fastest with moderate memory usage. Pandas was slower and consumed a significantly larger amount of memory, while SQLite was the slowest and had substantial memory consumption.
The Verdict: Choosing the Right Tool for the Job
So, which tool emerged victorious in this million-row data showdown? While Pandas excelled in simple aggregation, DuckDB consistently delivered the best balance of speed and memory efficiency across most of the queries. SQLite, while easy to use, generally lagged behind in performance.
- For Speed and Efficiency: DuckDB is the clear winner, especially for complex queries.
- For Simplicity and Small Datasets: SQLite remains a viable option.
- For General-Purpose Data Manipulation: Pandas is still a powerful and versatile tool, but be mindful of its memory usage with larger datasets.
The best tool for you will ultimately depend on your specific needs and the nature of your data. However, this benchmark provides valuable insights into the strengths and weaknesses of each tool, helping you make an informed decision for your next data analysis project.
Final Thoughts
In the world of data analysis, choosing the right tool can make a huge difference in productivity and efficiency. By carefully considering the performance characteristics of each option, you can ensure that you’re using the best tool for the job, ultimately leading to faster insights and better outcomes.

Leave a Reply