In this post, we will analyze a relatively small parquet file using various tools. On Microsoft Fabric, we will use basic Pandas, Pandas with multiprocessing, Polars, and DuckDB. We will also use cuDF.pandas on Google Colab. Below is the result.
These are in-memory analytic tools. Pandas is single-threaded, while Polars and DuckDB are multi-threaded and written in Rust and C++, respectively.

Pandas with multiprocessing did surprisingly well. Implementing DuckDB was the hardest as looping is hard unless using SQL Server.
The file that we work with is a table of mortgage applications with 1.6 million rows. As this table is relatively small, using cuDF was ineffective on Google Colab A100 so we do not include it here. Perhaps in another post, I will test all tools again on a much larger file.
We filter the applicant_income_000s >= k for each k within 0 and 10000, and we compute the mean of the other two columns. We use the magic comand %%timeit to find the average time.


GRAPH
fig,ax1=plt.subplots()
ax1.plot(x_values,rates, label='pull through', color='green')
ax1.set_xlabel('Income amount in 000s')
ax1.set_ylabel('Acceptance rate')
plt.legend()
ax2=ax1.twinx()
ax2.plot(x_values,loan_avgs, label='average loan', color='purple')
ax2.set_ylabel('Average Loan amount in 000s')
plt.legend(loc='lower left')
plt.show()

STANDARD PANDAS
def pull_through_rate_panda(k, df):
averages = df[df["applicant_income_000s"]>=k][["pull_through","loan_amount_000s"]].mean()
return averages
averages = [pull_through_rate_panda(k,data) for k in range(10000)]
PANDAS WITH MULTIPROCESSING
pool = Pool()
x_values = range(0,10000,1)
results=pool.map(pull_through_rate_panda,x_values)
pool.close()
pool.join()
POLARS
def pull_through_rate(k):
averages = data.filter(pl.col('applicant_income_000s') >= k).select([
pl.col('pull_through'),
pl.col('loan_amount_000s')
]).mean()
return averages['pull_through'][0],averages['loan_amount_000s'][0]
DUCKDB
query = '''
WITH RECURSIVE k_values(k) AS (
SELECT 0
UNION ALL
SELECT k+1 from k_values
WHERE k < 9999
),
averages AS (
SELECT
k,
AVG(CASE WHEN t.applicant_income_000s >= k THEN t.pull_through ELSE NULL END) AS rate,
AVG(CASE WHEN t.applicant_income_000s >= k THEN t.loan_amount_000s ELSE NULL END) AS loan
FROM k_values
LEFT JOIN my_table t ON t.applicant_income_000s >= k
GROUP BY k
)
SELECT rate, loan
FROM averages
ORDER by k;
'''
results_df = conn.execute(query).fetch_df()
Leave a comment